Resilience on AWS

Explore how to build resilient applications on AWS

What is Resilience?

Resilience is the ability of a system to withstand and recover from failures, disruptions, or unexpected events. Software in the cloud encompasses a range of characteristics to maintain the functionality, integrity, and robustness of applications even under the most challenging conditions. For example, the ability of your application to resist and recover from faults or load spikes and remain functional is a continuous process of design decisions, observability, and assessment. The collection of resilience resources on this page provides techniques, tutorials, and examples to help you build applications in your AWS cloud that are ready and primed for customers.

High Availability

Disaster Recovery

Continuous Improvement

Four Things Everyone Should Know About Resilience
Discover essential concepts for building resilient applications in the cloud on AWS, along with plenty of links to enable you to go deeper on these topics.

What is resilience - It is about resisting faults and load spikes and remaining up.

How to prevent faults from becoming failures - Faults happen all the time. How do you prevent them from becoming failures in your application in the cloud?

How to think about resilience - We like to think about it as a three-part model. This helps you understand the different strategies used to mitigate different types of faults.

How does the cloud help you build resilient applications - Learn about the tools and automation offered by the cloud to implement resilience best practices.

Learn more

Resilience Foundations

Building resilient applications on AWS involves a holistic approach, integrating key resources and frameworks. Starting with the "Reliability Pillar of the AWS Well-Architected Framework," builders can learn the best practices for creating resilience cloud workloads. The "Resilience Analysis Framework" further deepens this understanding by highlighting crucial failure modes and the trade-offs involved in implementing mitigations. To maintain and enhance resilience, the "Resilience Lifecycle Framework" presents a continuous improvement strategy across five stages. Finally, the AWS Resilience Hub empowers developers to assess and refine their applications' resilience, leveraging AWS's best practices and automated solutions for a robust resilience posture. Together, these resources provide a comprehensive path to achieving and sustaining resilient applications on AWS.

Reliability Pillar of the AWS Well-Architected Framework

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. This pillar whitepaper documents the best practices you need to build resilient applications on the cloud.

Learn more
Resilience Analysis Framework

Use resilience analysis to understand which failure modes are most important to protect your application against. This whitepaper introduces the SEEMS model, covering five common failure categories, with each letter in SEEMS standing for one of these failure modes: single points of failure, excessive load, excessive latency, misconfigurations and bugs, and shared fate.
Learn more
Resilience Lifecycle Framework: A continuous approach to resilience improvement

A continuous lifecycle enables you to always improve the resilience of your application. Based on years of working with customers and internal teams, this framework outlines five key stages and the activities in each to keep your application resilient.

Learn more

AWS re:Invent 2023 - Resilience lifecycle: A mental model for resilience on AWS (48 mins)
Building Resilient Well-Architected Workloads Using AWS Resilience Hub

The AWS Resilience Hub offers several capabilities to improve the resilience of your applications on AWS. It assesses your application against AWS Well-Architected best practices for resilience and gives specific guidance on how to improve your resilience posture. It also gives you templates to easily deploy new CloudWatch alarms, Fault Injection Service experiments, and automated runbooks in Systems Manager. With these, you can monitor and test your resilience, as well as automate actions that are part of a resilience strategy.

Workshop

AWS Resilience Hub Workshop

The goal of this workshop is to walk through the various functionalities of AWS Resilience Hub. By the end of the workshop you should have an understanding of the different service components and how to use the service to assess your workload resiliency.
Learn more

Blog

Building Resilient Well-Architected Workloads Using AWS Resilience Hub

AWS Resilience Hub is a new service that helps you understand and improve the resiliency of your workloads using AWS Well-Architected best practices.
Learn more

Whitepaper

Reliability Pillar of the AWS Well-Architected Framework

The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. This pillar whitepaper documents the best practices you need to build resilient applications on the cloud.

Learn more

Whitepaper

Resilience Analysis Framework

Use resilience analysis to understand which failure modes are most important to protect your application against. This whitepaper introduces the SEEMS model, covering five common failure categories, with each letter in SEEMS standing for one of these failure modes: single points of failure, excessive load, excessive latency, misconfigurations and bugs, and shared fate.

Learn more

Whitepaper , Video

Resilience Lifecycle Framework: A continuous approach to resilience improvement

A continuous lifecycle enables you to always improve the resilience of your application. Based on years of working with customers and internal teams, this framework outlines five key stages and the activities in each to keep your application resilient.

Learn more

re:Invent 2023

Blog, Workshop

Building Resilient Well-Architected Workloads Using AWS Resilience Hub

resilience_lifecycle

The AWS Resilience Hub offers several capabilities to improve the resilience of your applications on AWS. It assesses your application against AWS Well-Architected best practices for resilience and gives specific guidance on how to improve your resilience posture. It also gives you templates to easily deploy new CloudWatch alarms, Fault Injection Service experiments, and automated runbooks in Systems Manager. With these, you can monitor and test your resilience, as well as automate actions that are part of a resilience strategy.

Learn more

Learn

High Availability (HA)

High Availability (HA) takes a proactive approach to resilience. It's about designing your systems in such a way that they can automatically recover from common failures without human intervention. This could mean duplicating critical components, balancing loads across multiple servers, or using cloud services that can reroute traffic in the event of a network blip. High Availability (HA) is all about reducing the probability of a significant impact on your services due to small, frequent issues.

Availability and Beyond: Availability and Beyond: Improving the Resilience of Distributed Systems

This paper outlines a common understanding of availability as a measure of resilience, establishes rules for building highly available workloads, and offers guidance on how to improve workload availability.
Learn more
Dive into Best Practices

Go deeper to understand how to implement specific resilience best practices to achieve your availability goals

Blogs

How to Setup Replication Lag Monitoring for Amazon DynamoDB global tables Levelling up Your Releases: Reduce Risk with Blue/Green Deployments

Extra resources

Using load shedding to avoid overload Timeouts, retries, and backoff with jitter Static stability using Availability Zones

Docs

AWS Fault Isolation Boundaries

AWS re:Invent 2023 - 5 things you should know about resilience at scale (59 mins)
Cell-Based Architecture

Reducing the Scope of Impact with Cell-Based Architecture

The purpose of this guidance is to introduce a resilience analysis framework that provides a consistent way to analyze failure modes and how they could impact your workloads.

Learn more

AWS re:Invent 2023 - Reducing your area of impact and surviving difficult days (51 mins)
Health Checks

Implementing health checks Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling How to Build and Manage a Resilient Service Using Health Checks, Decoupled Dependencies, and Load Balancing Using AWS SDKs
Gray Failures

AWS re:Invent 2023 - Detecting and mitigating gray failures (55 mins)

Docs

Resilience Analysis Framework Advanced Multi-AZ Resilience Patterns - Detecting and Mitigating Gray Failures

Blogs

What Happened to My Car? Understanding Gray Failures Fix Gray Failures Fast Using Automation and Route 53 ARC Zonal Shift

AWS re:Invent 2023 - Using zonal autoshift to automatically recover from an AZ impairment (56 mins)

Workshop

Advanced Multi-AZ Resilience Patterns

Extra resources

Detecting gray failures with outlier detection in Amazon CloudWatch Contributor Insights Rapidly recover from application failures in a single AZ

Whitepaper

Availability and Beyond: Understanding and Improving the Resilience of Distributed Systems on AWS

This paper outlines a common understanding of availability as a measure of resilience, establishes rules for building highly available workloads, and offers guidance on how to improve workload availability.

Learn more

Dive into Best Practices

Go deeper to understand how to implement specific resilience best practices to achieve your availability goals.

AWS re:Invent 2023 - 5 things you should know about resilience at scale (ARC327 Static stability using Availability Zones Timeouts, retries, and backoff with jitter Using load shedding to avoid overload How to Setup Replication Lag Monitoring for Amazon DynamoDB global tables Levelling up Your Releases: Reduce Risk with Blue/Green Deployments AWS Fault Isolation Boundaries

Cell-Based Architecture

Reducing the Scope of Impact with Cell-Based Architecture ARC306 | Reducing your area of impact and surviving difficult days In this breakout session, you’ll learn about cell-based architectures and sharding. These are two ways you can structure your AWS resources (like compute, storage, and network) to improve resilience. These advanced techniques give you control over the fault isolation boundaries in your architecture, constraining faults to a small number of resources while the rest continue to serve requests from your customers.

Health Checks

Implementing health checks Choosing the right health check with Elastic Load Balancing and EC2 Auto Scaling How to Build and Manage a Resilient Service Using Health Checks, Decoupled Dependencies, and Load Balancing Using AWS SDKs

Gray Failures

Binary failure events are typified by a resource withering or not working. Detection and mitigation of these can be straightforward. However, gray failures are a different story. In this case, the system may sometimes fail and sometimes not. Manifestations of this type of failure can be subtle and defy quick and definitive detection. Let’s help you out in detecting and mitigating these with the following resources: Advanced Multi-AZ Resilience Patterns - Detecting and Mitigating Gray Failures What Happened to My Car? Understanding Gray Failures Fix Gray Failures Fast Using Automation and Route 53 ARC Zonal Shift Detecting gray failures with outlier detection in Amazon CloudWatch Contributor Insights Rapidly recover from application failures in a single AZ Advanced Multi-AZ Resilience Patterns ARC310 | Detecting and mitigating gray failures COP343| Building Observability to increase resiliency

ARC309 | Build applications that recover from an Availability Zone impairment — This session and ARC301 are a great pair together. In this breakout, you’ll learn about Amazon Route 53 Application Recovery Controller zonal shift. OK, that service is a mouthful, but what it does is super-powerful — it gives you control over which AZs are in or out for your application (which ones are receiving traffic). Using the monitoring techniques covered in this session, you’ll be able to detect when an AZ needs to be taken out-of-service, learn how to take it out, and keep healthy AZs online to serve your customer traffic. video

Disaster Recovery (DR)

Disaster Recovery (DR) is your safety net. It's the process and policies you put in place to recover from catastrophic events that can cause extended outages, such as natural disasters, cyberattacks, or significant technical failures. The goal here is to minimize downtime and data loss by having a robust backup and restore strategy. This involves not just backing up data but ensuring you can quickly restore operations, possibly in a different geographic location if necessary.

Learn more

Extra resources

DR Series Creating a Multi-Region Application with AWS Services series Creating DR Mechanisms Using Amazon Route 53 Minimizing Dependencies in a DR Plan Building highly resilient applications using Amazon Route 53 Application Recovery Controller, Part 2: Multi-Region stack Automate DR validation with AWS Backup Testing Backup and Restore of Data

Docs

Disaster Recovery (DR) of Workloads on AWS: Recovery in the Cloud Disaster Recovery (DR) of On-Premises Applications to AWS AWS Multi-Region Fundamentals

Workshop

DR with Amazon Route 53 Application Recovery Controller (ARC) Plan for DR Workshops

Continuous Improvement

Resilience is not a set-it-and-forget-it feature. It requires a commitment to continuous improvement. This means regularly testing your systems' ability to recover from failures, a practice known as chaos engineering. It also involves monitoring your systems in real time to quickly identify and address issues. By making resilience testing a part of your continuous deployment pipeline, you ensure that your architecture can adapt to new challenges and remain robust against unforeseen threats.

Learn more

Blogs

Choosing The Right Chaos Engineering Tool for the Job Chaos Engineering in under 2 minutes Automating Chaos Engineering in Your Delivery Pipelines Engineering Resilience: Lessons from Amazon Search's Chaos Engineering Journey Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over 84K Requests Per Second

Extra resources

Chaos Engineering in the cloud Towards continuous resilience

AWS re:Invent 2023 - Practice like you play: How Amazon scales resilience to new heights (54 mins)

Workshop

Chaos Engineering on AWS Workshop

Was this page helpful?

Feedback