AWS Disaster Recovery: 8 Best Practices, Architecture, Types

Worried about hardware or software issues, or a slew of other disaster events with your infrastructure? Read on to know about AWS Disaster Recovery.

All organizations that have applications and systems in a data center are under constant threat of data loss that might result from natural disasters, software errors, or hardware malfunction.

Events like these are sometimes inevitable and any amount of data loss can have an adverse impact on the company’s reputation and finances.

This is the reason companies should always have a disaster recovery plan in place to minimize losses.

Thankfully, AWS helps businesses create a solid, targeted, well-tested, and cost-effective disaster recovery plan.

This AWS Tutorial takes a detailed look into the AWS disaster recovery plan and its architecture. Let’s begin with what is AWS disaster recovery.

What Is AWS Disaster Recovery?

Amazon Web Services (AWS) enables you to set up disaster recovery, both for workloads deployed in the AWS Cloud and for on-premise services.

It offers four disaster recovery strategies to allow you to create replicas and backups to tackle disaster events.

These range from low complexity and low cost of making backups to more complex approaches relying on multiple active Regions.

It’s crucial to review and test your chosen disaster recovery strategy so that when it’s to invoke it, you have the confidence to do so. Let’s go into AWS disaster recovery strategies:

AWS Disaster Recovery Strategies

The four basic disaster recovery strategies offered by AWS include:

1. AWS Disaster Recovery Backup And Restore

Also referred to as the Cold Method, back-up and restore involves periodic backing up of the systems on tape and transmitting them off-site. It’s a suitable approach to minimize data corruption or loss.

The strategy also works to mitigate regional disasters. To do so, it will replicate data to alternative AWS Regions.

You may also use this strategy to address a lack of redundancy for workloads deployed to an Availability Zone. Other than data, you need to redeploy the application code, infrastructure, and configuration in the recovery region.

To ensure that the infrastructure is quickly redeployed and in an error-free manner, consider deploying using IaC (Infrastructure as code) using AWS Cloud Development Kit (CDK), AWS CloudFormation, or a similar service.

Restoring workloads in a recovery Region can be complex without IaC, in which case the recovery times can be high and even exceed your RTO.

You shouldn’t just back up the user data, but also the configuration and application code, including Amazon Machine Images (AMIs) that are used to create AWS EC2 instances.

To automate the redeployment of configuration and application code, consider using AWS CodePipeline.

2. AWS Disaster Recovery Pilot Light

With the Pilot Light strategy, the data is mirrored, a minimal version of the system keeps running in a different region, and the environment is scripted as a template.

Data is replicated from one Region to another and a copy of the core workload infrastructure is provisioned.

The idea is taken from the gas heater, in which a small flame stays on and when required, will immediately ignite the entire furnace.

The main component of the system, typically the database, remains activated at all times for data replication, while server images are developed and updated periodically for other layers.

When a disaster occurs, the backed-up Amazon machine images (AMIs) are used to build out and scale the environment around the pilot light.

The strategy not only reduces the RPO and RTO but also provides the ease of just turning on the resources, so the recovery takes not more than a few minutes.

To automate the provisioning of services, Amazon Cloud Formation can be used.

Unlike the case with the backup and restore strategy, the core infrastructure is always available with the Pilot Light strategy and you can switch on and scale your application servers to promptly provide a full-scale production environment.

The costs are not just relatively higher, but there are additional overheads for configuring, testing, and patching the services to adapt them to the production environment.

Warm Standby

This strategy involves having a scaled-down, yet fully functional copy of your production environment in a different region.

Since your workload is always in a different region, this approach is an extension of the Pilot Light concept that further decreases the recovery time.

With this strategy, it’s also easier to conduct testing or execute constant testing to build your confidence in recovering from disasters.

Like Pilot Light, Warm Standby includes an environment in your disaster recovery region with copies of assets in your primary Region.

Hence, the difference between the two approaches can be difficult to grasp. Keep in mind that Warm Standby can immediately handle traffic at low capacity levels, while pilot light can’t process requests without extra steps taken first.

Moreover, Warm Standby requires you to only scale up because everything has already been deployed and is running, while Pilot Light requires you to “switch on” servers, scale up.

And perhaps deploy additional infrastructure. To help you choose between Pilot Light and Warm Standby, examine your RPO and RTO needs.

Multi-Site Active/Active

The Multi-Site Active/Active strategy allows you to run your workload concurrently in different Regions.

With this approach, users can access your workload in every Region where it’s deployed.

While it’s the most costly and complex approach to disaster recovery, with the right technology choices and deployment, it can lower your recovery to almost zero for most disasters.

It’s important to understand the Hot Standby Active/Passive concept here that serves traffic only from one region, and the rest of the Regions are used only for disaster recovery, while the multi-site active/active serves traffic from all regions where it’s deployed.

The hot standby approach relies on the active/passive configuration, in which users are directed to only a single region, and no traffic is taken by disaster recovery regions.

Now that you know what AWS is and its four main strategies, let’s study some of the best practices for it:

AWS Disaster Recovery Best Practices

If you wish to minimize the recovery time and maximize reliability in the event of a disaster, follow these Best Practices tips:

1. Pick the Right Recovery Strategy

The first step is to choose the right strategy for recovering failed workloads in AWS. Your decision will depend on your recovery needs and budget.

If your disaster event is based on the loss or disruption of one physical data center for a highly available, well-architected workload, you should be covered by the backup and restore approach.

But a Multi-Site Active/Active, Warm Standby, or Pilot Warm strategy should be more appropriate if the disaster stretches to a loss or disruption of a Region or if the regulatory requirements demand that.

If your budget for AWS disaster recovery is substantial, invest in a multi-site active/active or warm standby strategy.

Also, you may even use more than one recovery technique simultaneously. For example, for a non-critical workload that can tolerate some downtime, a backup and restore approach should be suitable.

But for workloads that must be restored in the quickest possible time, you can use the multi-site active/active or warm standby method.

Try to achieve a combination that helps you strike the right balance between performance and planning the cost of recovery.

2. Determine the RTO and RPO

RTO refers to the amount of time important systems can be down before they pose a serious threat to the business, while RPO refers to the degree of data loss that can be tolerated after a technical outage without facing any impact on the business.

Like RTO, RPO is also measurable in time such as business data worth 24 hours. Different individuals or businesses have different requirements for RTO and RPO.

To determine your recovery needs, list down the workloads on which your business depends and then classify them based on their significance to the organization.

3. Develop a Backup Plan for EC2

According to AWS, you should either use EBS snapshots or AMI backups for backing up virtual machines that are running in the AWS EC2.

The former of which represents an EBS volume’s snapshot containing the data present in an EC2 instance, and the latter stands for Amazon Machine Images (AMIs), which comprise all the data needed to build an EC2 instance again which has failed.

4. Leverage AWS Backup Automation

To make the recovery process as fast as possible, use a disaster recovery plan that leverages automation.

AWS Backup is one such tool that represents a policy-based, centralized method to manage recovery and backup operations for various AWS resources.

It covers all major AWS services like most databases and EC2 instances, but not all. The resources are automatically backed up by the tool, which also performs restores automatically.

5. Test and Maintain Every AWS Recovery Plan

The AWS recovery plans you use should be tested regularly. Simulation is the simplest way of doing that.

This can involve creating a scenario in which an important workload doesn’t succeed and recovering it by executing your plan using the backups available.

Conduct the drills several times every year and more frequently for important workloads.

AWS disaster recovery types

Let us try to find out the types of DR we can configure.

Cloud DR – Generally companies keep an exact replica of their production environment in another region to maintain this DR setup.
DCDR – This is a disaster recovery setup for private Data Centres.
Network DR – Networking is a very integral part of any system, we must have a strategy in place to restore the same in the DR site.

AWS disaster recovery testing

To test whether the DR setup is working or not, generally, simulation is performed in the QA environment by manually bringing down the main environment and making sure that the DR site comes up as planned with the required manual intervention.

These types of tests are performed multiple times so that perfection can be achieved at the time of an actual disaster.

AWS disaster recovery architecture

AWS disaster recovery white paper

This is the whitepaper published by AWS related to disaster recovery setup.

AWS disaster recovery Pricing?

Pricing of the DR setup actually depends on the actual site. What services are used and how we have planned our disaster site all decide what price we incur for the DR site.

Sometimes we have an RDS DB on the actual site, so there are two options we can set up the DR. One is by setting up the actual RDS instance and the second is taking the regular backup and using it to restore the DB in case of the disaster. This saves our cost.

So, to sum up it all depends on the way you have planned the DR setup.

AWS disaster recovery Services?

We have a service provided by AWS to manage the DR in a better you, that service is called AWS Elastic Disaster Recovery.

It provides a reliable option to manage the DR better. Lesser downtime, minimal loss of data, and better RPO and RTO.

AWS disaster recovery Options?

There are a few ways that we can configure the DR site.

Active/Active – In this, we have an exact replica of production live. Actually, we have 2 live sites in this setup. Of course, this is one of the costliest DR setups. RPO and RTO are negligible here.
Backup/Restore – Here we have a backup of Production and we restore it in DR when the main site goes down. This is the cheapest setup with RPO and RTO very high.
Warm Standby – In this case, we maintain a lower version and mission-critical application active in the DR site. Here the RPO and RTO are less than the active setup as well as the cost is less than active/active.
Pilot Light – In this case, we have the data sync active but all the services are down. In this case, the RPO and RTO are less than Warm Standby but better than Backup and Restore so is the cost more than Standby setup.

Conclusion

By now, you should have developed a fair understanding of AWS disaster recovery and its architecture, including what it is, the AWS disaster recovery strategies out there, and some AWS disaster recovery best practices for approaching a disaster event.

We truly hope that this guide proves to be your ultimate answer to approaching AWS disaster recovery.

You may also like to explore below AWS blogs.

Steve

I am an Amazon Web Services Professional, having more than 11 years of experience in AWS and other technologies. Extensively working in various AWS tools like S3, Lambda, API, Kinesis, Load Balancers, EKS, ECS, and many more. Working as a Solution Architect and Technology Lead for Architecting and implementing the same for different clients. He provides expert solutions around the world and especially in countries like the United States, Canada, United Kingdom, Australia, New Zealand, etc. Check out the complete profile on About us.