Chaos Engineering | Notes | César D. Velandia

Introduction

Main objectives:

Part of overall Resilience approach
Surface evidence of weaknesses before crisis and outages
Make everyone responsible for code in production
Improve availability, stability, robustness, service in case of outages
Assess how well code is running in your organization.

Mindset, process, practices and some tools are : Chaos engineering experiment exploration based on hypotheses, turn experiments into Game days, turn experiments into automated learning loops, flesh out collaborative and operational concerns

The Chaos toolkit is a ree and open source tool with CLI and libraries that enables writing automated chaos experiments and orchestrate them against systems

Importance

Weaknesses in infrastructure, platforms, applications or even people, policies, practices, and playbooks in the middle of full-scale outage

Experimenting on the system in order to install connfidence in its capability to withstand turbulent production conditions

Users want reliability, focus on establishing evidence-based resilience facts about a system.

Chaos engineering: provide evidence of system weaknesses –dark debt– via experiments to provide insight into your systems capabilities to respond.

Dark debt

Appears in complex systems causing failures, not immediately recognizable. Caused by interacting physical and logical parts. No specific countermeasure until is revealed. It is a by-product of complex modern software.

Dependencies are common hot areas for dark debt in the form of slugish callbacks, unreachable services, intermittent responses, high throughput and its implications to fina users. Even in the cases where there is a strong suspicion Chaos engineering might help by providing evidence to back them up. Also helps some feedback loops and actual behavior of planned capabilities in the face of executed resiliency features, e.g., circuit breakers, what are the implications?

Stakeholders include DevOps, apps, services, infrastructure, platforms, monitoring, pipelines, documentation, ticket systems, traffic, users, third parties. Implies mitigation and risk management. Overcome weakness beforehand.

Chaos engineering embraces failure in production, and proactively creates experimentst to build confidence in the production system, including people, processes, and practices.

When looking for dark debt look into infrastructure(hardware, VMs, cloud providers, IaaS, network) , platforms (k8s and abstract systems) , applications (code base), people, practices and process (other sociotechnical system).

Process


                     +----------------------------+
                     |                            |
                     | What do we know X will do? |
                     |                            |
                     +----------------------------+
                                  |
                +--------------Hypotheses-----------+
                |                                   |
     +----------v----------+            +-----------v-----------+
     |                     |            |                       |
     |      Game days      |            | Automated Experiments |
     |                     |            |                       |
     +---------------------+            +-----------------------+
               |                                         |
               |                                         |
               |      +----------------------------+     |
               |      |                            |     |
               +----->+          Evidence          +<----+
                      |                            |
                      +----------------------------+

Practices

Game days are used to define a manual experiments run by teams, where everyone can see how the system handles the failure in a production-like environment. These are usually low cost and not scalable.

Continuously changing production, by means of continuous deployment create turbulence, thus automated chaos experiments become highly desirable. These can setup carefully using available tools with minimal intervention. Make a goal to automate experiments using the tools at hand such that intervention is not required and no disruption is caused to other work.

For production chaos experiments, it is often considered a limited –Blast radius– to avoid production disruptions. Take controlled risks to instill confidence by trying to find weaknesses in your system.

Start with small Blast radius experiments, in a sandbox environment (staging?)
Progressively expand the Blast radius until no evidence of weakness are found
Dial back Blast radius and move to production, rinse and repeat

Finding weaknesses in production encourage improvement more than finding them in a staging environment!

One of the main advantages of chaos experiments is improved observability.

A running system's ability to be debugged. The ability to comprehend, interrogate, probe, and ask questions of a system while it is running is at the heart of this debuggability. – Charity Majors

Go hand in hand as in chaos engineering you want to detect evidence of a system's reactions to out of the ordinary conditions, using the improved observability of a system.

Experiments, game days, automated chaos experiments are shared responsibilities of the owners of the system, and dedicated engineers can help coordinate them in a way such that everyone can learn from the findings.

Hypotheses building

Core objective: Build confidence and trust during turbulent conditions by exposing in a controlled way to network failure, outages, latency, and see the results. One key principle is to prepare well so you can learn adequately from the experiments.

Experiments

What do you want to learn? What experiments might be valuable? Not really about finding a way to creating the most chaos.

What do we want to gather more evidence (support or refute) for?
What could go wrong?

Create a hypothesis backlog with these questions. Also use past incidents and "what if" questions or ask around what are major concerns using a sketch.

Incident analysis: Identify main contributing factors and supplement the conclusions by using chaos engineering findings. It is a after the fact way of learning so it shouldn't be used as only method for improvement – look at more proactive steps.
Sketch of the system: Paint a common (or multiple) picture of your system, with people, practices, and processes. Make it detailed so specific questions show up and you start developing sensitivity to failures.

People, practice, process
- Maintainers, CI/CD, stages used, monitoring tools, stakeholders

Application
- Timeouts, persistence and integration

Platform
- Dependencies of/for in the the platform. Orchestration 

Infrastructure
- Machines used, virtual, physical, setups

Capturing mental models should be done over time and is part of the continuous improvement of the system. Some crucial questions to ask:

Could this fail
Has this caused problems before?
If this fails, what else could be impacted?

Build a large collection of possible failures, e.g, DB cluster is unavailable, and detail as much as possible to develop sensitivity and build up a backlog.

Failure Mode and Effects Analysis Lite

Part of reliability engineering. A simplified version to assess which possible failures will become hypotheses. Its process overview:

1- Detailed sketch of system

2- Collect possible system failures from different perspectives

3- Reach consensus on the relative likelihood x impact of such failures (which is to some extent guesswork, the experiments will help find more evidence)

     ^
     |
     +---------------------------------------------+
     |              |            X  |              |
     |              |   X           |  X       X   |
   I |              |        X      |        X  X  |
   m |              |               |              |
   p +---------------------------------------------+
   a |              |             X |              |
   c |              |               |   X          |
   t |              |     X         |              |
     |              |  X            |              |
     +---------------------------------------------+
     |       X   X  |               |              |
     |              |             X |              |
     |   X   X      |               |              |
     |  X           |               |              |
     +------------------------------------------------->
                        Likelihood

Impact likelihood grid with failures mapped out

Each group might have its own tendency to optimism or pessimism, if unhappy with the spread – relocate the cards by setting a new threshold.

4- Estimate the value of confidence building in how to react to each failure based on *-ilities you care about (reliability, safety, security, durability, availability)

5- Create a hypothesis backlog based on high-impact + likelihood and with important "ability", go for the low-hanging fruit, go for the failures with most "ability" = greatest value. Use the team's chosen criteria.

6- Convert failures into hypotheses by negating their impact, e.g,


        +-----------------------------------------+
        |                                         |
        |      the system will meet its SLO       |
        |                                         |
        |                                         |
        |  if DB cluster X becomes unavailable    |
        |                                         |
        |                                         |
        |           ---contribution---            |
        |         availability, durability        |
        |                                         |
        +-----------------------------------------+

Hypothesis card

Resources

Book: Learning chaos engineering