The Road to Autonomous Cloud Security Remediation

Published 12/07/2023

Written by Tomer Schwartz, Co-founder & CTO, Dazz.

Back in the data center days, a typical enterprise had one or two applications and one or two engineering teams to deploy them. When there was a vulnerability, an engineer could simply log into a server and fix it. The problem was that no one did that in practice, which is why we had so many breaches, where many of them used known vulnerabilities.

There was a brief moment in history where “automated remediation” or “automated response” was all the rage, but organizations quickly learned that when you modify production infrastructure directly without going through testing, the application may break. That risk calculus was not so promising: either change production and risk that it breaks, or keep it as is and risk that it will be breached.

The majority of the market landed on not breaking the application, opting instead to take on the risk of a breach in the future. (Remember Windows XP images in ATMs? Banks were running operating systems with known vulnerabilities because it was less risk than having their ATMs stop working due to a naive change.)

Automated Remediation - A Risk Management Problem

The rationale was simple back then: we just can’t trust anyone — machine or human — to change our production environment. Fast forward to today, and now with GenAI, it seems like all of this would change! Well, not quite. The risk is the same, the mechanics of the process are different. GenAI can produce wonderful results, but sometimes the machine is still wrong due to simple mistakes, errors, false training data, hallucinations, and adversarial inputs.

Is this an AI problem? No, it’s a risk management problem. In the case of using AI for remediation, for example, we need to show that our remediation program is robust enough to not likely to break anything, and we need to prove that the risk is less than the reward in this equation. To do this, we need to build trust. Our engineering teams need to be confident that our remediation program has the right stopgaps in place.

First, we need to appreciate that modern cloud environments cannot have downtime, reliability, or integrity issues. They’re also set up in a particular way, typically through a DevOps pipeline. If we have a vulnerability in production, we have to find its root cause, fix it there, redeploy the application, and verify that it was fixed correctly. This is a (semi) easy task for an LLM, as long as we have all the data.

Assuming we solved the root cause analysis problem, it is technically possible to ask an LLM how to upgrade a container, for example. And with lots of nuance and iteration to get not only the right answer, but the best answer for all of your particular circumstances. The more context the LLM is trained with, and the more tweaked the prompt, the more exact the answer will be.

How AI Is Changing Remediation

Enter, DevOps pipelines… CI/CD models play into this example nicely. If we already have a fix at the root cause, we can use the existing platform to build, test, and deploy code, then share the results of that process with the AI model.

Assuming we manage to build the application and get it to run in a lower environment, we can rescan the environment and see if the vulnerability is gone. If not, we can always feed it back to the model and hope for another proposal for a change, up to the point we get a clean build and a clean scan.

The fact that code gets built and passes a test suite doesn’t mean it works, but that could also be the case with a real engineer. Remember that it’s about trust, not about cybersecurity. We need to ask the right engineer if they approve the change. After a while we will gain their confidence to do this automatically.

So where does that leave us with GenAI for cybersecurity? The proposed architecture may end up being a reality in three to five years from now, since most enterprises are not quite ready yet because of other dependencies around automated CI/CD pipelines and appropriate test harnesses, legal and compliance risk, and some level of standardization.

Today, the industry can benefit from GenAI by using it to accelerate the speed of remediation, to reduce the costs associated with finding and fixing issues, and to enhance the workflow between security teams and developers. The key is to build trust and confidence in AI-driven remediation and gradually implement it into organizational practices. With time, the industry will move closer to achieving the goal of autonomous cloud security remediation.

Artificial Intelligence Risk Management