Business Continuity and Disaster Recovery in the Cloud

Published 10/31/2021

Written by Megan Theimer, Content Marketing Manager, CSA.

Business Continuity and Disaster Recovery (BC/DR) is just as important for cloud computing as it is for any other technology. However, specific considerations for the cloud need to be kept in mind. This blog will provide an overview of how to approach BC/DR in the cloud, including the overarching principle of architecting for failure and considerations for three different scenarios.

How to Approach BC/DR in the Cloud

Like security and compliance, BC/DR is a shared responsibility. There are aspects that the cloud provider has to manage, but the cloud customer is also ultimately responsible for how they use and manage the cloud service.

Additionally, BC/DR must take a risk-based approach. Many BC options may be cost prohibitive in the cloud, but may also not be necessary. For example, the odds of a major IaaS provider going out of business or changing their entire business model are low, but this isn’t all that uncommon for a smaller venture-backed SaaS provider.

Architect for Failure

Cloud platforms can be incredibly resilient, but single cloud assets are typically less resilient than traditional infrastructure. This is due to the inherently greater fragility of virtualized resources running in highly-complex environments. However, this means that cloud providers tend to offer options to improve resiliency, often beyond that which is attainable in traditional infrastructure. For example, you can improve resiliency by enabling multiple “zones” where you can deploy virtual machines within an auto-scaled group that encompasses physically distinct data centers for high-availability.

This extra resiliency is only achievable if you architect to leverage these capabilities. Deploying your application all in one zone, or even on a single virtual machine in a single zone, is likely to be less resilient than deploying on a single, well-maintained physical server.

Remember:

Not all assets need equal continuity.
Don’t drive yourself crazy by planning for full provider outages just because of the perceived loss of control. Look at historical performance.
Strive to design for recovery time objectives (RTOs) and recovery point objectives (RPOs) equivalent to those on traditional infrastructure.

Business Continuity Within the Cloud Provider

When you deploy assets into the cloud, you can’t assume the cloud will always be there, or always work the way you expect. A key point is that the very nature of virtualizing resources into pools typically creates less resiliency for any single asset, like a virtual machine. On the other hand, abstracting resources and managing everything through software opens up flexibility to more easily enable resiliency features like durable storage.

There is a huge range of options here, and not all providers or platforms are created equal, but you shouldn’t assume that “the cloud” as a general term is more or less resilient than traditional infrastructure. This is why it is typically best to re-architect deployments when you migrate them to the cloud.

Some points to keep in mind:

Understand and leverage the platform’s BC/DR features before adding on any additional capabilities through third-party tools.
BC/DR must account for the entire logical stack, including metastructure, infrastructure, infostructure, and applistructure.
When real-time switching isn’t possible, design your application to gracefully fail in case of a service outage. There are many automation techniques to support this.
Downtime is always an option. You don’t always need perfect availability, but if you do plan to accept an outage, you should at least ensure you fail gracefully, with emergency downtime notification pages and responses.

Business Continuity for Loss of the Cloud Provider

It’s always possible that an entire cloud provider, or at least a major portion of its infrastructure, can go down. Depending on the history of your provider, and their internal availability capabilities, accepting this risk is often a legitimate option. Downtime may be another option, but it depends on your RTOs. Be wary of selecting a secondary provider or service if said service may also be located or reliant on the same provider.

SaaS may often be the biggest provider outage concern, due to total reliance on the provider. Scheduled data extraction and archiving may be your only BC option outside of accepting downtime. Extracting and archiving to another cloud service, especially IaaS/PaaS, may be a better option than moving it to local/on-premises storage.

Business Continuity For Private Cloud

This is completely on the provider’s shoulders. RTOs and RPOs should be stringent, since if the cloud goes down, everything goes down.

If you are providing services to others, be aware of contractual requirements, including data residency, when building your BC plans. For example, failing over to a different geography in a different legal jurisdiction may violate contracts or local laws.

Summary of Recommendations

Architect for failure.
Take a risk-based approach to everything.
Design for high availability within your cloud provider.
Prepare for graceful failure in case of a cloud provider outage.
Cloud providers, including private cloud, must provide the highest levels of availability and mechanisms for users to manage aspects of their own availability.

The information in this blog was taken from CSA’s Security Guidance for Critical Areas of Focus in Cloud Computing. Read the Security Guidance for an in-depth look at cloud security best practices. For more details about cloud disaster recovery, you can read our publication Disaster Recovery as a Service.

Risk Management Security Guidance