Security Chaos Engineering: Fewer Blind Spots and Improved Stress Testing Move CISOs Closer to Cyber Resilience

Published 02/01/2024

Written by Luke Luckett.

Headlines in the press over the past few quarters have shown that resilience in the financial sector can be stress tested – sometimes with little warning. According to the Federal Reserve, the form of stress testing they conduct assesses whether individual “banks are sufficiently capitalized to absorb losses during stressful conditions while meeting obligations … and continuing to be able to lend to households and businesses.”

Thinking big picture, the Fed also conducts capital stress tests on the collective banking industry. These have become a critical tool since the 2007-09 financial crisis, aiding in “bolstering confidence” in the sector.

This got me thinking about CIOs and CISOs, to whom I ask: Beyond a binder full of security protocols and procedures, have you tested your response to actual cyber risks? How quickly are your teams and technology able to respond to an incident? Have you ever stress tested the security of your systems the way cyber attackers do?

While teams that sit under CIOs and CISOs test and secure the systems within their respective organizational structures, financially-motivated malicious actors and those seeking access for cyber espionage goals couldn’t care less about internal silos. They continually poke, looking for weaknesses to serve as exploit paths.

Security Chaos Engineering and the Call to Maintain Cyber Resilience

Boards and cybersecurity leaders are reconsidering the way they view their portfolios of data and technology. The reality is that an organization will be breached at some point—it happens to some large enterprises daily. Forward-thinking CIOs and CISOs have started to define and measure resilience in their organizations, with the idea that, according to the National Academy of Sciences’ definition, organizations will be better if they “prepare and plan for, absorb, recover from, and more successfully adapt to adverse events.”

In their new book, “Security Chaos Engineering: Sustaining Resilience in Software and Systems,” authors Kelly Shortridge and Aaron Rinehart create a compelling argument for moving away from divisions within security, such as “offensive vs. defensive prevention,” and urge us instead to shift our primary focus to maintaining resilience.

According to the authors, if we “architect our systems to expect failure, proactively challenge our assumptions through experimentation, and incorporate what we learn as feedback into our strategy, we can more fully understand how our systems work and how to improve and best secure them. Instead of seeking to stop failure from ever occurring, the goal in resilience and chaos engineering is to handle failure gracefully.”

In security chaos engineering (SCE), experiments (the “chaos”) are introduced intentionally to ensure cybersecurity systems and processes work correctly, which in turn results in better resilience.

Resilience Requires Internal Balancing of Risk Against Product and Sales Efficiency

Designing, building and managing more resilient systems requires a shift from siloed information security programs to “platform resilience engineering.” As Shortridge and Rinehart note, the platform resilience engineering model “treats security as a product with end users, as something created through a process that provides benefits to a market (with internal teams as our customers).”

According to the authors, leaders can build resilience into their security practices (and internal security products) by:

Designating a platform engineering team
Identifying real problems and iterating on solutions
Prioritizing usability to promote adoption
Ensuring stakeholders get the “why” in what makes a product more resilient
Avoiding building security controls around what’s convenient for security teams; instead, building and promoting products that are flexible to how humans actually behave

In Practice: Stress Test Your Security Before an Attacker Does

According to insurer Zurich, conducting cyber stress tests is important, for “the idea behind a stress test is to determine the critical systems, people and locations needed to continue to serve customers and how best to protect and recover them.” They are not alone in this line of thinking. Cardinal Health, the Ohio-based health care services company, makes a regular practice of stress testing and has become especially fond of security chaos engineering.

In a final example from the “Security Chaos Engineering” deep dive, chapter contributors Jamie Dicken and Rob Duhart, Jr. reached consensus with Cardinal’s executive team and their Security Architecture teams in thinking that “theoretical security architecture would not protect an organization sprinting to the cloud. What Cardinal Health needed was ‘Applied Security’ – a means to bring continuous verification and validation concepts” of security chaos engineering to the enterprise.

The first step for Jamie and Rob was to identify the mission, which was “to identify unknown technical security gaps and partner with the organization to remediate them before they are exploited.”

The second step was defining and executing Jamie and Rob’s chosen method, Continuous Verification and Validation (CVV), which has four steps:

Establish benchmarks by which gaps are identified, called Applied Security Benchmarks (ASBMs), and socialize them
Apply continuous checks that systems adhere to the ASBMs. Before any code is written, open source and commercial products are evaluated and a “build vs. buy” decision is made
Create a dashboard to illustrate real-time compliance of systems, visually showing technical gaps while keeping findings strategic for leaders
If adherence to the ASBMs decreases, an issue is added to the risk register for internal remediation, allowing the Applied Security team to move on to implement CVV in other areas while remediations naturally occur in parallel

The final step – which started early in their chaos experiments and has become continuous – was to create a culture that embraces security chaos engineering. “Rather than building walls and hoping they won’t be breached, SCE introduces disruption to maximize the security value within an environment,” note the authors, who recognize that what success looks like is an organization that operates in a truly agile and automated security mindset, with SCE scaled widely across teams and products.

Continuous Security Testing Is a Critical Piece of Resilience Stress Testing

To gain control over vulnerability management and to meet resilience goals of being able to “fail gracefully,” leaders need to ensure security testing is part of their resilience stress testing. Continuous pentesting of APIs as well as web, host, cloud and mobile assets are critical elements of resilience stress testing and vulnerability management, with resilient organizations having confidence in their ability to bounce back from an incident.

C-Level Cloud Incident Response Penetration Testing Risk Management