Getting closer to meeting (almost) all your resilience requirements…
…or what you’ve been missing about Chaos Engineering
We’ve all been observing how micro-service based distributed environments keep gaining adoption worldwide among millions of software vendors. We build systems that tend to fail in numerous different ways than what we are used to seeing on resource-limited and controlled testing stages. Usual testing scenarios do not generate new knowledge; they simply result in a true or false statement. Therefore, traditional testing addresses only a small portion of quality problems. Yet, we all depend on these systems more than ever, and their failures have become much harder to predict. Gradually, this becomes a challenge for software vendors, who cannot afford to sit and wait for the next costly system disruption to happen. For that reason, more and more companies start adopting the Chaos Engineering discipline. But what does this discipline promote, how is the practice defined, and where is it going?
Let’s start with some Chaos…
Chaos theory is a branch of mathematics stating that, within the apparent randomness of chaotic complex systems, there are underlying patterns, interconnectedness, constant feedback loops, repetition, self-similarity, fractals, and self-organization. In the context of the above, the Chaos Engineering discipline in the computer software industry is an approach for identifying in what ways complex systems react to apparently random actions. The same way a scientist would conduct scientific research to advance their knowledge in an area of interest, the principles of Chaos Engineering promote executing experiments to identify the strengths and weaknesses of a system. Learning about their weaknesses, companies start addressing them sooner, which prevents potential customer-harming outages. Proactiveness, in this case, dominates over most of today’s reactive-based incident response models.
What differentiates Chaos practices from traditional QA testing…?
Most of the traditional testing methodologies would eventually emit concrete results, given that they have specific conditions. It’s truly an assertion-based approach, where there is little to no new knowledge generated about a system. Therefore, relying purely on traditional testing scenarios cannot guarantee that large microservice-based applications or any enterprise systems will respond properly under pressure conditions. As opposed to that, Chaos Engineering promotes experimentation practices, which tend to generate valuable insights regarding how reliable and robust a system is. We can conduct experiments towards parts of a system, which we know have caused issues before. The controlled nature of Chaos practices allows engineers to limit the impact area (blast radius) with the idea to stop potential harm on other parts of a system.
Let’s talk about possible Chaos experiments…
- Test how a service degrades when resources are exhausted (high CPU load, memory allocation, read/write pressure, etc.)
- State-based experiments (shutdown hosts, time travels, etc.)
- Network faults (injecting latency, blocking access to DNS servers, etc.)
- Application-based chaos (runtime exceptions)
There are countless different scenarios for running chaos experiments, and it all varies based on architecture. Those experiments help teams to build the necessary knowledge and expertise in resolving outages.
Nowadays, just a small outage in a single AWS availability zone could cascade and result in the disruption of millions of online services worldwide. That’s indeed where Gremlin comes in. It’s a Chaos Engineering service founded by CEO Kolton Andrus, who’d previously architected the well-known failure injection service of Netflix. Gremlin is an end-to-end solution, which presents a rich set of tooling for the successful disruption of entire systems. With a rich tooling, accompanied by well-designed and powerful APIs, Gremlin enables engineers to build-up complex and controlled experimentation scenarios. As a matter of fact, Gremlin encapsulates complex engineering solutions, in order to offer what’s been previously mentioned as a system-disrupting environment. Using the experimentation packs, software vendors can thoughtfully inject failure into host or containers regardless of where they are — whether that’s the public cloud or their own data center. Attacks can be configured, started, and stopped via the web app, API or CLI.
At Resolute Software, we always aim for building highly available architectures with resiliency being one of our top priorities. Stay tuned for the upcoming article on getting hands-on with Gremlin! In the meantime, feel free to jump in, and read out some docs Gremlin!
Feel free to get in touch with us.
Originally published at https://www.resolutesoftware.com on April 21, 2020.