7 Best Practices for Cloud Incident Response
Published 10/18/2022
Originally published by Mitiga here.
Written by Matthew Stephen, Mitiga.
You may have heard the saying that it is not a matter of “if” but “when” you will experience a breach. An attack could be targeted or opportunistic, performed by a nation-state or a less sophisticated threat actor, focused on theft of intellectual property or ransomware, and started with a zero-day exploit or a phishing email (and these are not necessarily mutual exclusive). Regardless of the specific details, organizations must be prepared to respond to a breach when one occurs. The more organizations move applications and services to the cloud, the more it is important to plan for cloud incident response. These seven best practices will help you get started.
1. Understand the differences between cloud and on-premises environments as well as the shared responsibility model
With a traditional on-premises environment, organizations typically deploy antivirus and/or endpoint detection and response (EDR) tools on every system. With infrastructure as a service (IaaS), such as AWS EC2, that method may still be used. However, with platform, software, or function as a service (PaaS, SaaS, or FaaS), such as AWS EKS, RDS, or Lambda, the cloud provider manages the underlying system, so installing endpoint software is not feasible. In those instances, organizations must manage detection and response through other means.
Additionally, cloud providers share responsibility for security within customer cloud environments. For example, with IaaS, the cloud provider is responsible for securing physical hosts and networks. With PaaS and SaaS, the cloud provider is responsible for securing the operating system and application, respectively. The customer is responsible for securing their own information and identities. As AWS describes it, the cloud provider is responsible for security “of” the cloud and the customer is responsible for security “in” the cloud. Some cloud providers may also provide assistance, such as collecting logs or providing recommendations, during an incident involving their services.
2. Use the principle of least privilege and zero trust
The principle of least privilege suggests that a user or resource should only be given the least amount of privileges required to complete its task. For example, if a user only needs read-only access to a single application, that user should not be granted write or administrative privileges to that application or access to other applications. When an organization creates a new cloud account, there is typically a “root” user that has full access to everything. That user should be locked down with multi-factor authentication and not used for everyday activity. Subsequent users should follow the principle of least privilege to mitigate the risk to the organization if a user’s credentials are compromised.
The same principle should be applied to resources. For example, databases and other critical systems probably should not be publicly accessible from the Internet, and it probably is not necessary to allow SSH or RDP directly between every system. Instead, databases should be accessible only from the application that uses the database. Also, SSH and RDP access can go through secure jump boxes instead of allowing direct connections from every system in the environment.
Zero trust is an extension to the principle of least privilege that, according to Microsoft, also explicitly and continuously verifies every transaction to make decisions using all available data points, including identity, location, device health, resource, data classification, and anomalies. Additionally, zero trust architecture assumes there is a breach and works to minimize the blast radius with micro-segmentation, end-to-end encryption, continuous monitoring, and automated threat detection and response.
3. Configure, centralize, and secure logs
For systems not managed with traditional security tools, such as EDR, logs are critical. Understand what logs are enabled by default and what needs to be explicitly enabled. For example, AWS CloudTrail records management events but not data events by default. Additionally, VPC Flow logs or S3 server access logs are not enabled by default but can provide additional visibility into the environment. According to IBM, it took organizations an average of 287 days to identify and contain a breach. For non-ransomware breaches, the average is likely higher. Ensure log retention is configured for a sufficient period, likely one year or longer, so logs are still available when you need them.
Logs should be centralized to ensure they are readily available in the event of an incident. For example, AWS supports creating an organization trail for all AWS accounts in the AWS organization. Even without configuring an organization trail, AWS supports storing logs in S3 buckets. These logs should be stored securely to prevent accidental or malicious deletion. Logs will not do much good if a threat actor deletes them before they are reviewed.
4. Take advantage of built-in monitoring and security tools
Major cloud providers like AWS provide cloud-native tools, such as GuardDuty and CloudWatch, to use for proactive threat detection. As the cloud provider likely has thousands or even over a million customers, they benefit from more visibility, expertise, and intelligence than most other organizations have. Depending on the types of threats targeting your organization, the cloud provider has likely seen similar threats in other customer environments. Those tools can provide quick wins in identifying malicious or suspicious activity.
5. Know how to preserve evidence
When responding to an incident, the top priority might be minimizing business impact or fulfilling legal reporting obligations, but preserving evidence is important to ensure it is not lost or altered prior to analysis. In legacy on-premises forensics, this typically involves creating forensic images of disks from compromised systems while maintaining the chain of custody, which is a process that tracks the movement of evidence through its collection, safeguarding, and analysis lifecycle. In AWS, evidence preservation might include storing EC2 instance metadata, EBS disk snapshots, CloudTrail logs, VPC Flow logs, and other logs or artifacts.
6. Test incident response processes regularly
An incident response plan (IRP) provides a set of tools to use or processes to follow in the event of an incident. However, for organizations that have never experienced a significant incident, the IRP may not be tested to ensure it still makes sense and aligns with business, compliance, and regulatory requirements. Also, the IRP may be focused on the traditional on-premises environment and not cloud. In lieu of an actual incident, red teams, purple teams, blue teams, and tabletop exercises can help test the IRP and identify gaps. If gaps are identified, be sure to make the necessary updates to the plan itself.
7. Augment internal capabilities with a cloud incident response partner
Depending on size, organizations may or may not have a dedicated IR team. Even with a dedicated team, they may already be fully extended with day-to-day activities when a significant incident occurs. As with cloud providers, a cloud IR partner likely has visibility into more environments and incidents than an individual organization. As such, a cloud IR partner can help improve readiness before an incident occurs and minimize the impact of an incident when it occurs.
Related Articles:
The Lost Art of Visibility, in the World of Clouds
Published: 11/20/2024
Group-Based Permissions and IGA Shortcomings in the Cloud
Published: 11/18/2024
Zero Standing Privileges (ZSP): Vendor Myths vs. Reality
Published: 11/15/2024