Resilience on AWS
Explore how to build resilient applications on AWS
What is Resilience?
Resilience is the ability of a system to withstand and recover from failures, disruptions, or unexpected events. Software in the cloud encompasses a range of characteristics to maintain the functionality, integrity, and robustness of applications even under the most challenging conditions. For example, the ability of your application to resist and recover from faults or load spikes and remain functional is a continuous process of design decisions, observability, and assessment. The collection of resilience resources on this page provides techniques, tutorials, and examples to help you build applications in your AWS cloud that are ready and primed for customers.
-
Four Things Everyone Should Know About Resilience
Discover essential concepts for building resilient applications in the cloud on AWS, along with plenty of links to enable you to go deeper on these topics.
- What is resilience - It is about resisting faults and load spikes and remaining up.
- How to prevent faults from becoming failures - Faults happen all the time. How do you prevent them from becoming failures in your application in the cloud?
- How to think about resilience - We like to think about it as a three-part model. This helps you understand the different strategies used to mitigate different types of faults.
- How does the cloud help you build resilient applications - Learn about the tools and automation offered by the cloud to implement resilience best practices.
Resilience Foundations
Building resilient applications on AWS involves a holistic approach, integrating key resources and frameworks. Starting with the "Reliability Pillar of the AWS Well-Architected Framework," builders can learn the best practices for creating resilience cloud workloads. The "Resilience Analysis Framework" further deepens this understanding by highlighting crucial failure modes and the trade-offs involved in implementing mitigations. To maintain and enhance resilience, the "Resilience Lifecycle Framework" presents a continuous improvement strategy across five stages. Finally, the AWS Resilience Hub empowers developers to assess and refine their applications' resilience, leveraging AWS's best practices and automated solutions for a robust resilience posture. Together, these resources provide a comprehensive path to achieving and sustaining resilient applications on AWS.
-
Reliability Pillar of the AWS Well-Architected Framework
The AWS Well-Architected Framework helps you understand the pros and cons of decisions you make while building workloads on AWS. This pillar whitepaper documents the best practices you need to build resilient applications on the cloud.
-
Resilience Analysis Framework
Use resilience analysis to understand which failure modes are most important to protect your application against. This whitepaper introduces the SEEMS model, covering five common failure categories, with each letter in SEEMS standing for one of these failure modes: single points of failure, excessive load, excessive latency, misconfigurations and bugs, and shared fate.Learn more -
Resilience Lifecycle Framework: A continuous approach to resilience improvement
A continuous lifecycle enables you to always improve the resilience of your application. Based on years of working with customers and internal teams, this framework outlines five key stages and the activities in each to keep your application resilient.
-
Building Resilient Well-Architected Workloads Using AWS Resilience Hub
The AWS Resilience Hub offers several capabilities to improve the resilience of your applications on AWS. It assesses your application against AWS Well-Architected best practices for resilience and gives specific guidance on how to improve your resilience posture. It also gives you templates to easily deploy new CloudWatch alarms, Fault Injection Service experiments, and automated runbooks in Systems Manager. With these, you can monitor and test your resilience, as well as automate actions that are part of a resilience strategy.WorkshopAWS Resilience Hub Workshop
The goal of this workshop is to walk through the various functionalities of AWS Resilience Hub. By the end of the workshop you should have an understanding of the different service components and how to use the service to assess your workload resiliency.Learn moreBlogBuilding Resilient Well-Architected Workloads Using AWS Resilience Hub
AWS Resilience Hub is a new service that helps you understand and improve the resiliency of your workloads using AWS Well-Architected best practices.Learn more
Reliability Pillar of the AWS Well-Architected Framework
Resilience Analysis Framework
Resilience Lifecycle Framework: A continuous approach to resilience improvement
Building Resilient Well-Architected Workloads Using AWS Resilience Hub
High Availability (HA)
High Availability (HA) takes a proactive approach to resilience. It's about designing your systems in such a way that they can automatically recover from common failures without human intervention. This could mean duplicating critical components, balancing loads across multiple servers, or using cloud services that can reroute traffic in the event of a network blip. High Availability (HA) is all about reducing the probability of a significant impact on your services due to small, frequent issues.
-
Availability and Beyond: Availability and Beyond: Improving the Resilience of Distributed Systems
This paper outlines a common understanding of availability as a measure of resilience, establishes rules for building highly available workloads, and offers guidance on how to improve workload availability.Learn more -
Dive into Best Practices
Go deeper to understand how to implement specific resilience best practices to achieve your availability goalsBlogsExtra resources -
Cell-Based Architecture
Reducing the Scope of Impact with Cell-Based ArchitectureThe purpose of this guidance is to introduce a resilience analysis framework that provides a consistent way to analyze failure modes and how they could impact your workloads.
-
Health Checks
-
Gray Failures
Docs
Availability and Beyond: Understanding and Improving the Resilience of Distributed Systems on AWS
Dive into Best Practices
Cell-Based Architecture
Gray Failures
ARC309 | Build applications that recover from an Availability Zone impairment — This session and ARC301 are a great pair together. In this breakout, you’ll learn about Amazon Route 53 Application Recovery Controller zonal shift. OK, that service is a mouthful, but what it does is super-powerful — it gives you control over which AZs are in or out for your application (which ones are receiving traffic). Using the monitoring techniques covered in this session, you’ll be able to detect when an AZ needs to be taken out-of-service, learn how to take it out, and keep healthy AZs online to serve your customer traffic. video
Disaster Recovery (DR)
Disaster Recovery (DR) is your safety net. It's the process and policies you put in place to recover from catastrophic events that can cause extended outages, such as natural disasters, cyberattacks, or significant technical failures. The goal here is to minimize downtime and data loss by having a robust backup and restore strategy. This involves not just backing up data but ensuring you can quickly restore operations, possibly in a different geographic location if necessary.
-
Learn more
Extra resourcesDR Series Creating a Multi-Region Application with AWS Services series Creating DR Mechanisms Using Amazon Route 53 Minimizing Dependencies in a DR Plan Building highly resilient applications using Amazon Route 53 Application Recovery Controller, Part 2: Multi-Region stack Automate DR validation with AWS Backup Testing Backup and Restore of Data
Continuous Improvement
Resilience is not a set-it-and-forget-it feature. It requires a commitment to continuous improvement. This means regularly testing your systems' ability to recover from failures, a practice known as chaos engineering. It also involves monitoring your systems in real time to quickly identify and address issues. By making resilience testing a part of your continuous deployment pipeline, you ensure that your architecture can adapt to new challenges and remain robust against unforeseen threats.
-
Learn more
BlogsChoosing The Right Chaos Engineering Tool for the Job Chaos Engineering in under 2 minutes Automating Chaos Engineering in Your Delivery Pipelines Engineering Resilience: Lessons from Amazon Search's Chaos Engineering Journey Any Day Can Be Prime Day: How Amazon.com Search Uses Chaos Engineering to Handle Over 84K Requests Per SecondExtra resourcesWorkshop