In May 2017, at London’s two of the busiest airports, British Airways grounded all of its flights. That day 75,000 passengers of British airways were impacted because of an IT failure. Airline investigations reported that it was the case of poor resilience and lack of proper disaster recovery after a power surge at a UK-based data center.
The CEO of British Airways recently explained how that one IT failure cost the company 80 Million Pounds!
Even in your companies, failures are bound to happen.
“Many unknown factors can not be anticipated in the regular testing scenario for application failure. So this raises the ultimate question – Is regular testing sufficient?
What if your system was rebooted by mistake? Or your access management was compromised?
These kinds of problems cannot be calculated and solved in regular tests.”
– Suratip Banerjee, Solutions Architect at Principal Global Services, at a webinar with Whizlabs on Chaos Engineering and AWS Fault Injection Simulator.
These failures cause costly outages for companies. The outages hurt customers trying to shop, transact business, and get work done. Even brief outages can impact a company’s bottom line, so the cost of downtime is becoming a KPI for many engineering teams.
In 2017, 98% of the organizations said that a single hour of downtime would cost their business almost a million dollars. That is a huge risk.
So companies need a solution to this challenge and waiting for the next outage is not an option to meet the challenges head-on.
Hence, you need Chaos Engineering.
On 12 September 2021, Whizlabs hosted a webinar on Chaos Engineering and AWS FIS.
Suratip Banerjee, who was the featured speaker, explained in detail all the aspects of Chaos Engineering, AWS FIS, and their benefits.
What Is Chaos Engineering?
Chaos engineering is the process of stressing an application in a test or production environment.
It is conducted by creating disruptive events such as server outage, API throttling, or latency. Then the system’s response is observed. Finally, we implement our improvements, and we do that to prove or disprove the assumptions of our system capability to handle these disruptive elements.
These experiments have the added benefit of helping teams build muscle memory in resolving outages, akin to a fire drill. By breaking things on purpose we surface unknown issues that could impact our systems and customers.
Instead of letting these events happen at 3 am or on weekends, we create them in a controlled environment during working hours when all our teams and engineers are ready to tackle the issue.
Benefits of Chaos Engineering
- Customer: The increased availability and durability of service means no outages disrupt the customer’s day-to-day lives.
- Business: Chaos Engineering can help prevent losses in revenue and maintenance costs, create happier and more engaged engineers, improve on-call training for engineering teams, and improve the SEV (incident) Management Program for the entire company.
- Technical: The insights from chaos experiments can mean a reduction in incidents, reduction in on-call burden, increased understanding of system failure modes, improved system design, faster mean time to detection for SEVs, and reduction in repeated SEVs. Exposes monitoring observability and alarm blind spots. Improves recovery time, operation scales, and more.
After the use of chaos engineering, 47% of businesses reported increased availability, and 45% reported reduced Mean To Time Ratio(MTTR)!
Principles of Chaos Engineering
The key to performing Chaos Engineering is to understand its principles and then follow a well-planned process based on these principles.
It refers to the performance of the system in a normal state.
Initially, you have to search for measurable results that link operating metrics and customer experience. For the output to be in the steady state, the behavior of the system observed should be predictable, but vary greatly if failure is introduced.
“What if this Load Balancer breaks?”
“What if this database stops?”
“What if latency increases by 300ms?”
Sit with your team and after brainstorming, pick out one scenario that is most likely to happen or should be prioritized. This hypothesis must not be too complicated. It should be placed upon the part of the system that you believe to be resilient.
Design The Experiment
The best ways to begin with the experiment phase are:
- Start small
- Make it closest to the production
- Minimize the blast radius
- Have an emergency STOP!
Verify and Learn
At this stage, you analyze the result of the experiment. You can evaluate your report with respect to these pointers:
- Time to detect
- Time for notification and escalation
- Time to public notification
- Time for graceful degradation to kick in
- Time for self-healing to happen
- Time for recovery – Partial and full
- Time to all clear and stable
It is the concluding stage of your engineering experiment. Here you fix and learn from the failures that the system faced during the experiment.
Fault injection experiments are the fundamental part of chaos engineering.
A Fault Injection Simulator(FIS) simplifies the process by creating real-world conditions that are required to uncover the possible failures in the application.
The AWS FIS is a completely managed service that helps you to improve an application’s performance, observability, and resilience by conducting failure injection tests on AWS.
Each AWS FIS experiment targets a specific set of AWS resources and performs a set of actions on them.
Components of AWS FIS
Actions are the fault injection actions executed during an experiment.
- Fault type
- Targeted resources
- Timing relative to any other actions
- Fault specific parameters such as rollback behavior, or the portion of the request to throttle
A target defines one or more resources on which an Action is to be carried out. You define targets while creating an experiment template.
When you define a target, you particularize the following:
- The resource type
- Resource IDs, tags, and filters
- Selection mode (e.g. ALL, RANDOM)
An experiment template is a blueprint of your experiment. It contains the Action target and STOP conditions for that experiment. So, after you create an experiment template you can use it to run an experiment.
Experiment templates include:
- Stop condition alarms
- IAM role
Experiments are snapshots of the experiment template when it was first launched by a couple of additions.
- Snapshot of the experiment
- Creation and start time
- Execution ID
- Experiment template ID
- IAM role ARN
To see the actual working of AWS FIS along with a detailed explanation of Chaos Engineering and its process, you can watch our intensive webinar. Here, our featured speaker and expert, Suratip Banerjee, has thoroughly covered Chaos Engineering, its process, challenges, relevant industry scenario, and AWS FIS procedure.
To learn about the above topics in detail and more, you can watch our recorded webinar here:
- Cloud Armor – A Complete Guide - September 28, 2021
- GitOps: Continuous and Progressive Deployment in AWS EKS – Sivamuthu Kumar - September 27, 2021
- What is Cloud AutoML? - September 24, 2021
- What is Cloud NAT? - September 23, 2021
- A Guide to GKE Clusters (Google Kubernetes Engine) - September 22, 2021