AWS Disaster Recovery: 8 Best Practices, Architecture, Types

Worried about hardware or software issues, or a slew of other disaster events with your infrastructure? Read on to know about AWS Disaster Recovery.

All organizations that have applications and systems in a data center are under constant threat of data loss that might result from natural disasters, software errors, or hardware malfunction.

Events like these are sometimes inevitable and any amount of data loss can have an adverse impact on the company’s reputation and finances.

This is the reason companies should always have a disaster recovery plan in place to minimize losses.

Thankfully, AWS helps businesses create a solid, targeted, well-tested, and cost-effective disaster recovery plan.

This AWS Tutorial takes a detailed look into the AWS disaster recovery plan and its architecture. Let’s begin with what is AWS disaster recovery?

What Is AWS Disaster Recovery?

Amazon Web Services (AWS) enables you to set up disaster recovery, both for workloads deployed in the AWS Cloud and for on-premise services.

It offers four disaster recovery strategies to allow you to create replicas and backups to tackle disaster events.

These range from low complexity and low cost of making backups to more complex approaches relying on multiple active Regions.

It’s crucial to review and test your chosen disaster recovery strategy so that when it’s to invoke it, you have the confidence to do so. Let’s go into AWS disaster recovery strategies:

AWS Disaster Recovery Strategies

The four basic disaster recovery strategies offered by AWS include:

1. AWS Disaster Recovery Backup And Restore

Also referred to as the Cold Method, back-up and restore involves periodic backing up of the systems on tape and transmitting them off-site. It’s a suitable approach to minimize data corruption or loss.

The strategy also works to mitigate regional disasters. To do so, it will replicate data to alternative AWS Regions.

You may also use this strategy to address a lack of redundancy for workloads deployed to an Availability Zone. Other than data, you need to redeploy the application code, infrastructure, and configuration in the recovery region.

See also  AWS Server Migration Service (SMS): Migration and Transfer

To ensure that the infrastructure is quickly redeployed and in an error-free manner, consider deploying using IaC (Infrastructure as code) using AWS Cloud Development Kit (CDK), AWS CloudFormation, or a similar service. 

Restoring workloads in a recovery Region can be complex without IaC, in which case the recovery times can be high and even exceed your RTO.

You shouldn’t just backup the user data, but also the configuration and application code, including Amazon Machine Images (AMIs) that are used to create AWS EC2 instances.

To automate the redeployment of configuration and application code, consider using AWS CodePipeline.

2. AWS Disaster Recovery Pilot Light

With the Pilot Light strategy, the data is mirrored, a minimal version of the system keeps running in a different region, and the environment is scripted as a template. 

Data is replicated from one Region to another and a copy of the core workload infrastructure is provisioned.

The idea is taken from the gas heater, in which a small flame stays on and when required, will immediately ignite the entire furnace.

The main component of the system, typically the database, remains activated at all times for data replication, while server images are developed and updated periodically for other layers.

When a disaster occurs, the backed-up Amazon machine images (AMIs) are used to build out and scale the environment around the pilot light.

The strategy not only reduces the RPO and RTO but also provides the ease of just turning on the resources, so the recovery takes not more than a few minutes.

To automate the provisioning of services, Amazon Cloud Formation can be used.

Unlike the case with the backup and restore strategy, the core infrastructure is always available with the Pilot Light strategy and you can switch on and scale your application servers to promptly provide a full-scale production environment.

The costs are not just relatively higher, but there are additional overheads for configuring, testing, and patching the services to adapt them to the production environment.

Warm Standby

This strategy involves having a scaled-down, yet a fully functional copy of your production environment in a different region.

Since your workload is always on in a different region, this approach is an extension of the Pilot Light concept that further decreases the recovery time.

With this strategy, it’s also easier to conduct testing or execute constant testing to build your confidence in recovering from disasters.

Like Pilot Light, Warm Standby includes an environment in your disaster recovery region with copies of assets in your primary Region.

Hence, the difference between the two approaches can be difficult to grasp. Keep in mind that Warm Standby can immediately handle traffic at low capacity levels, while pilot light can’t process requests without extra steps taken first.

See also  What Is the Difference Between CloudWatch, CloudTrail, and Flow Logs?

Moreover, Warm Standby requires you to only scale up because everything has already been deployed and is running, while Pilot Light requires you to “switch on” servers, scale-up.

And perhaps deploy additional infrastructure. To help you choose between Pilot Light and Warm Standby, examine your RPO and RTO needs.

Multi-Site Active/Active

The Multi-Site Active/Active strategy allows you to run your workload concurrently in different Regions.

With this approach, users can access your workload in every Region where it’s deployed.

While it’s the most costly and complex approach to disaster recovery, with the right technology choices and deployment, it can lower your recovery to almost zero for most disasters.

It’s important to understand the Hot Standby Active/Passive concept here that serves traffic only from one region, and the rest of the Regions are used only for disaster recovery, while the multi-site active/active serves traffic from all regions where it’s deployed.

The hot standby approach relies on the active/passive configuration, in which users are directed to only a single region, and no traffic is taken by disaster recovery regions.

Now that you know what AWS is and its four main strategies, let’s study some of the best practices for it:

AWS Disaster Recovery Best Practices

If you wish to minimize the recovery time and maximize reliability in the event of a disaster, follow these Best Practices tips:

1. Pick the Right Recovery Strategy

The first step is to choose the right strategy for recovering failed workloads in AWS. Your decision will depend on your recovery needs and budget.

If your disaster event is based on loss or disruption of one physical data center for a highly available, well-architected workload, you should be covered by the backup and restore approach.

But a Multi-Site Active/Active, Warm Standby, or Pilot Warm strategy should be more appropriate if the disaster stretches to a loss or disruption of a Region or if the regulatory requirements demand that.

If your budget for AWS disaster recovery is substantial, invest in a multi-site active/active or warm standby strategy.

Also, you may even use more than one recovery technique simultaneously. For example, for a non-critical workload that can tolerate some downtime, a backup and restore approach should be suitable.

But for workloads that must be restored in the quickest possible time, you can use the multi-site active/active or warm standby method.

Try to achieve a combination that helps you strike the right balance between performance and planning cost of recovery.

See also  How Much Does AWS Certification Cost [Basic, PRO, Specialty]

2. Determine the RTO and RPO

RTO refers to the amount of time important systems can be down before they pose a serious threat to the business, while RPO refers to the degree of data loss that can be tolerated after a technical outage without facing any impact on the business.

Like RTO, RPO is also measurable in time such as business data worth 24 hours. Different individuals or businesses have different requirements for RTO and RPO.

To determine your recovery needs, list down the workloads on which your business depends and then classify them based on their significance to the organization.

3. Develop a Backup Plan for EC2

According to AWS, you should either use EBS snapshots or AMI backups for backing up virtual machines that are running in the AWS EC2.

The former of which represents an EBS volume’s snapshot containing the data present in an EC2 instance, and the latter stands for Amazon Machine Images (AMIs), which comprise all the data need to build an EC2 instance again what has failed.

4. Leverage AWS Backup Automation

To make the recovery process as fast as possible, use a disaster recovery plan that leverages automation.

AWS Backup is one such tool that represents a policy-based, centralized method to manage recovery and backup operations for various AWS resources.

It covers all major AWS services like most databases and EC2 instances, but not all. The resources are automatically backed up by the tool, which also performs restores automatically.

5. Test and Maintain Every AWS Recovery Plan

The AWS recovery plans you use should be tested regularly. Simulation is the simplest way of doing that.

This can involve creating a scenario in which an important workload doesn’t succeed and recovering it by executing your plan using the backups available.

Conduct the drills several times every year and more frequently for important workloads.

AWS disaster recovery types

AWS disaster recovery testing

AWS disaster recovery architecture

AWS disaster recovery white paper

AWS disaster recovery Pricing?

AWS disaster recovery Services?

AWS disaster recovery Options?


By now, you should have developed a fair understanding of AWS disaster recovery and its architecture, including what it is, the AWS disaster recovery strategies out there, and some AWS disaster recovery best practices for approaching a disaster event.

We truly hope that this guide proves to be your ultimate answer to approaching AWS disaster recovery.

You may also like to explore below AWS blogs.

Leave a Comment