Why do developers (sometimes) wreak chaos on their own systems?
In the ideal world, high-load systems used by thousands of people across time zones should be able to survive – and quickly recover from – sudden traffic spikes, server outages, interruptions to Internet connectivity, and other such problems. When engineers work to proactively mitigate the consequences of aforementioned problems, DevOps folks talk about continuous resilience.
In the same way that we have moved from a few big software releases a year to continuous delivery of many small changes, we need to move from annual disaster recover tests or suffering when things actually break, to continuously tested resilience.
But how do you arrive at continuous resilience? In order to do that, you should know everything there is to know about the way your system behaves when parts of it break down.
That’s where chaos engineering steps in. Chaos engineering is a process of deliberately causing problems in your system in order to see how it works in extreme conditions.
How chaos engineering came to be
The history of modern chaos engineering goes back to Netflix. As the company was migrating to the cloud in 2011, its principal engineers decided to ask developers to think about resilience not as an afterthought, but a built-in ingredient of the system. With a multi-million user base, Netflix realized that connectivity and availability issues were doomed to happen. Rather than waiting for precedents, they decided to be proactive about ensuring that such incidents would not cause major interruptions to the service.
So Netflix created Chaos Monkey – a tool that would randomly kill off production instances, forcing developers to beef up redundancy to ensure an acceptable level of service. (Netflix open-sourced Chaos Monkey in 2012 and the code is now licensed under the Apache 2.0 software license.)
The great value of this approach became obvious in 2017 – when a major AWS outage occurred, ironically right in the middle of AWSome Day conference in Edinburgh, Scotland. Netflix was one of the few global app providers that day that was able to restore service within mere hours instead of days.
What is chaos engineering?
While it can be many things, chaos engineering usually involves testing for:
- Infrastructure failures
- Network failures
- Application failures
One doesn’t necessarily need to use Netlix’s Chaos Monkey for that, though. A case study from Cerner shows that you may not want to use the de-facto chaos-wreaking tool, because it could then become a scapegoat for all sorts of mysterious problems. Here is what Cerner’s Carl Chesser said about this (emphasis added):
Spinnaker has a capability we can use, Chaos Monkey, and there are other tools out there to even do chaos things in your Kubernetes cluster. What we found was every time we would try to consider introducing something else that was causing chaos without that focus of an experiment, it was becoming like a scapegoat for mysterious problems. If something was going wrong, it was, “Yes, it’s probably that Chaos Monkey,” without really finding out that it was going to be that case.
What Cerner did instead was always put forth a clear hypothesis of how the system should behave if they did something specific – for example, if they took out a part of the infrastructure, say, by killing a full hypervisor in a particular availability zone. And here is what Cerner learned from those experiments:
- The team should always prepare for an experiment (you need to decide who needs to do what, what the expected effect/behavior is, what the measurement mechanism is, what the metrics are, and prepare the necessary capture mechanisms).
- You should ensure sufficient observability and find blind spots, if there are any.
- The team should do it in a dedicated space where everyone can sit together.
- You should capture all the surprises (expected vs. actual behavior) and share them in an open, searchable repository.
What Cerner found out was that not only did chaos engineering provide them with a much better understanding of their system (which allowed them to move to Kubernetes pretty much in a breeze), but they also acquired a lot of confidence and understanding regarding who should turn to whom during a crisis.
In conclusion
Chaos engineering is not an entirely new concept, but it takes a mentality shift within an organization to plan proactively for situations when worst comes to worst. Do you practice continuous resilience in your org? Is your experience different from that of Netflix or Cerner? Do let us know in the comments!