Cloud 101CircleEventsBlog
Join AT&T's experts & CSA's Troy Leach on April 4 to boost your cyber resilience in 2024!

Why Now is the Time for Chaos (Engineering)

Why Now is the Time for Chaos (Engineering)

Blog Article Published: 11/30/2021

Written by Morey J. Haber, BeyondTrust.

If you are old enough to remember the television show Get Smart, you may be familiar with the concept of applying chaos to provide order. While this spy show spoof mocked political stereotypes of the time for control versus chaos, the slapstick humor was designed to highlight the differences between a world with, or without order. In some cases, chaos definitely seemed like the better alternative.

So, what does Get Smart have to do with chaos engineering? Sometimes chaos is the key ingredient for us to put the world in perspective and to better understand and make key discoveries about how complex systems operate, and how they are vulnerable.

Chaos engineering is the concept of experimenting on a resource with the goal of building confidence in the resource’s capability to tolerate unpredictable circumstances during operations. It’s sort of like a more sophisticated version of having a monkey throw a wrench into a complex machine and seeing what happens. In fact, Netflix, which popularized the concept of chaos engineering, monikered their chaos-making tool “Chaos Monkey”.

Applying Chaos to Uncover Cloud Risks

The cloud, digital transformation, and the massive use and dependency on software has truly changed our lives. Businesses have developed millions of lines of code in a short period of time (10 years or so) and it begs the question of resiliency and security for the solutions we have put into production.

Today, the number of applications deployed in the cloud is mind boggling. These applications were developed by many thousands of different vendors, the open-source community, and in diverse locations. How can we ensure these applications and systems will operate correctly when other security, scalability, and other environmental issues become unpredictable? After all—who knows what the next attack vector or cause of an outage will be?

Our world grows more digital by the day. From industrial IoT to wearables, every scrap of our reality is being digitized. As the Internet is built into almost anything and everything in our lives, real-world events (i.e. pandemics, natural disasters, etc.) and cybersecurity threats can have a significant impact on these product environments in ways humans cannot always easily anticipate. Even advanced simulation software is not great at predicting potential second and third order effects of unanticipated events. The resolutions to return stability may not always be well understood. After all, you cannot just reboot the cloud like your personal computer. The results for the cloud / multicloud environment could be distributed in nature, affecting all the different components in unpredicted ways. Hence, chaos and not control.

To understand the problem across the entire environment, the risk surface itself must be understood and the potential outcomes established--from data corruption, through denial-of-service. This encompasses everything from attack vectors to collateral outages that could cause a localized issue or a cascading failure, including the inability to access resources needed to ultimately resolve the issue.

Once the potential outcomes are uncovered via testing and understood, you can embark on the process of remediation. Address your weakest issues first, especially for single points of failure. Other remediations might entail complex issues that have likelihood of occurring and that could impact service level agreements, customer satisfaction, or other stated missions. The goal is ultimately to improve the stability of, and confidence in, cloud solutions.

Chaos engineering allows for testing of cloud resources based on a controlled environment, in production, while testing the system based on realistic conditions, including attacks, outages, and other forms of corruption (chaos). The outcome demonstrates what could really happen when chaos is introduced. This is in sharp contrast to the controlled testing generally performed by quality assurance.

5 Steps to Effectively Unleashing Chaos to Improve System Security & Stability

Implemented with care and diligence, chaos engineering can be a powerful tool for facilitating experiments that uncover systemic weaknesses in a controlled production environment. Chaos engineering provides the unique and paralleled potential to tackle the insecurity of distributed systems in the cloud at scale.

These experiments typically can be broken into five steps:

  1. Start by defining ‘normal operations’ as some measurable output of the environment that indicates proper and expected behavior. This will be your control group, as opposed to your chaos engineering test group.
  2. Hypothesize that this ‘normal operations’ will continue in both the control group and the chaos experimental group. These are your best educated guesses as to what will happen.
  3. Design the experiment to include individual tests, combinations of tests, and a mix of manual and automated steps. This will help you develop resolution plans when the events occur in the real world.
  4. Introduce attacks, changes, outages, hardware failures, virtual machine (VM) and instance failures, etc. that reflect real-world events to measure in the cloud. Collect the results.
  5. Document the performance of the system and expected availability when comparing the control group versus the chaos engineering test group. This will help you engineer remediation and apply the solution to avoid any future, undesirable results.

The primary goal is to make it very difficult to deviate from the expected steady state of normal operations, You want to ensure predictable behavior from the system when chaotic events are introduced, whether by themselves, or in conjunction with each other.

5 Best Practices for Chaos Engineering Implementations

The following five recommendations provide the best practices for chaos engineering implementations. The recommendations should correlate to the confidence of your testing and remediation plans to resolve any identified issues.

  1. Model normal operational behavior: Focus on the measurable output of the system, such as streaming video. Also make sure to include internal system metrics, such as CPU consumption, error rates, network latency, etc. All these metrics could model normal steady state behavior. By focusing on systemic changes during experiments, chaos engineering verifies that the system is working as expected, as opposed to trying to validate how it works.
  2. Model real issues: To provide value, chaos engineering requires the input of attacks, outages, and other issues that reflect real-world events—even if such events seem to have a low potential likelihood. It is imperative to prioritize testing criteria by risk and/or other priorities relevant to the business. Even though you are testing in the cloud, consider events that correspond to an instance failure, VM outage, software failure, network outage, denial of service, malformed traffic, vulnerabilities, etc. that could occur. Any test capable of disrupting the normal operations is a potential candidate for a chaos engineering experiment.
  3. Experiments in production: Performing these experiments in production based on defined tests is critical to derive value from chaos engineering. Cloud resources will absolutely behave differently in development, test, quality assurance, and production environments. Since real-world operations can truly change at any time, the product ion environment is critical to the success for this method.
  4. Automation: Running experiments manually is labor-intensive and, eventually, unsustainable for a chaos engineering project. The best way to tackle chaos engineering is to automate experiments and bind combinations of tests together to measure the impact when unlikely combinations are applied. The results will help prove what can be remediated as well as indicate what events or combinations of events may lead to unmitigable catastrophe.
  5. Controlling chaos: Experimenting in production has the potential to cause myriad problems, ranging from outages, performance degradation, corruption of data, and more. Such impacts will ultimately anger clients and users. Thus, consider limiting chaos engineering experiments to certain regions, clients, tenants, applications, instances, etc. and conduct them only during well-thought out periods of time. A well-designed failover should also be in place.

Getting Smart with Chaos

As systems become ever-more complex and interdependent, chaos engineering is a valuable tool to uncover vulnerabilities and potential points of failure that are otherwise exceedingly difficult, if not impossible, for humans to anticipate.

Chaos engineering can be implemented to further uncover risks of cloud threat vectors by leveraging many standardized techniques (e.g. penetration testing) against your environment. While other tests will address scalability and resiliency, chaos engineering helps address uncertainty in distributed systems when real-world events are applied in combination. The results will provide confidence in your cloud implementation and properly test the controls in place from your cloud service provider to help mitigate any long-term outages or other security concerns.


About the Author

Morey J. Haber is the Chief Security Officer at BeyondTrust. He has more than 25 years of IT industry experience and has authored three books: Privileged Attack Vectors, Asset Attack Vectors, and Identity Attack Vectors. He is a founding member of the industry group Transparency in Cyber, and in 2020 was elected to the Identity Defined Security Alliance (IDSA) Executive Advisory Board. Morey currently oversees BeyondTrust security and governance for corporate and cloud based solutions and regularly consults for global periodicals and media. He originally joined BeyondTrust in 2012 as a part of the eEye Digital Security acquisition where he served as a Product Owner and Solutions Engineer since 2004. Prior to eEye, he was Beta Development Manager for Computer Associates, Inc. He began his career as Reliability and Maintainability Engineer for a government contractor building flight and training simulators. He earned a Bachelor of Science degree in Electrical Engineering from the State University of New York at Stony Brook.

Share this content on your favorite social network today!