Shadow Data is Inevitable, but Security Risks Aren’t

Published 04/12/2023

Originally published by Dig Security.

Written by Benny Rofman.

Shadow data is unavoidable. It’s always been around, but the move to the cloud and the push towards data democratization have made it far more common. It’s never been easier to create shadow data assets, and employees have an incentive to do so when it can help them get the insights they’re looking for, faster. In most cases this isn’t cause for concern – but the story doesn’t end here.

The problems begin when sensitive data gets in the mix – e.g., customer health records stored in databases that security teams aren’t even aware exist. If you’re looking to harden your cloud data security posture, addressing these situations has to be on your roadmap. But how do you balance between security, agility, and data democratization?

Shadow Data and the Cloud: A Match Made in Heaven

Businesses love the public cloud because it makes everything easy to deploy. Rather than begging a centralized IT team to allocate resources for a new data initiative, smaller dev or analytics teams can easily spin up new cloud resources and start filling them with data.

‍Current trends such as data democratization and business agility implicitly encourage shadow data. These principles essentially boil down to smaller teams accessing data independently – bypassing traditional gatekeepers in IT, DevOps, or DBA departments. Marketing analysts might move customer data to Google BigQuery in order to analyze product usage patterns, while Support stores a copy of the same data in Snowflake as part of a ticket NLP project. These tools can be set up in a few clicks and require little pre-existing knowledge beyond SQL.

When infrastructure is easier to deploy, it’s harder to monitor. And while infrastructure security has come a long way in recent years, it has struggled to catch up with the rate in which shadow data is being created in the cloud – which can easily lead to a leak of sensitive or regulated information.

‍Shadow data is uniquely risky since it is an ‘unknown unknown’ – it’s not just that security teams don’t know where to find it, they don’t even know that they should be looking for it. And by definition, sensitive data being stored in a shadow data store is not subject to the organization’s standard security policies and is not being monitored.

Common Shadow Data Scenarios

Shadow data can be generated as part of testing, backup, cloud migration, or in regular business operations. In many cases, this can actually help teams accomplish more, faster – and we wouldn’t necessarily want to discourage this. But if sensitive data is being forgotten or abandoned, it poses an exciting opportunity for cyber-attackers looking to steal data or commit ransomware attacks.

Let's talk about where to look for shadow data, based on real life scenarios we’ve encountered in our work with customers.

Object storage (AWS S3, Google Cloud Storage, Azure Blob) – the biggest culprit

Object storage is unstructured, inexpensive, and typically broadly accessible. It also tends to be the biggest component of the organizational data estate. Even though it is the obvious suspect for hidden shadow data, building effective detection mechanisms is challenging – and the shadow data often goes unnoticed.

‍Real-life examples:

A data scientist uses Databricks to run a very specific, one-off transformation in order to answer a business question; they then store the results in S3, just in case they’ll need them for some future analysis. If this is done using anonymized, non-PII data, it’s not much of an issue; whereas if they’ve just created a copy of customer credit card information, it poses a problem for both security and compliance.
A company dumps its Redis instance, which contains PII, into an unencrypted S3 bucket. The security team is unaware of the problem – it’s just one more S3 bucket with an inconspicuous name. A malicious actor with access to the cloud environment would have no problem getting their hands on this data. In addition to the threat of the data being exposed, the company is at risk of dozens of compliance violations due to mishandled customer data.

Unmanaged data stores – a complete black box

In a world of on-demand compute, it’s impossible for security teams to monitor every new VM. But if these machines are used to run databases, you now have a data asset whose contents are invisible to most security solutions.

Real-life examples:

A developer is trying to solve a data quality issue. She spins up a new Postgres instance and fills it with production data to run a test. This can be a snapshot or, in the worst case scenario, an automatically updated replica of a database containing sensitive information. Once the project is done, the database should be deleted – in practice, they are often left running, leaving the data dormant in the environment for anyone to access.
A company has an unmanaged MariaDB instance in their cloud environment. The database contains hundreds of gigabytes of data copied from production one year earlier, which included thousands of electronic health records. It is no longer running but the data is still there – dormant and ripe for attack. Due to personnel changes, no one can really explain what this database is meant to be doing. The CSPM platform alerts the company to its existence, but it is not seen as high priority since there is no indication of the scale of sensitive data it contained.

Duplicates on managed data stores

Partitions, snapshots, staging tables, and ELT jobs will often lead to duplicate and triplicate copies of data being created in cloud data assets such as BigQuery or Snowflake . While these tools have some built-in monitoring, the sheer number of services and copies can make monitoring close to impossible.

The Need for Data-Centric Security

While the examples above are on the egregious side, similar situations are extremely common. We regularly find dozens of forgotten database dumps and orphaned snapshots in customer environments.

When data plays such a major part in every business process, you’re going to have more people doing more things with more data – and as the organization grows, keeping track of it all becomes near-impossible with existing posture management solutions.

If you can’t eliminate shadow data, what can you do about it? How can you ensure that it’s not creating a major security liability?

Policy and posture are a necessary-but-insufficient first step. Cloud security posture Management (CSPM) tools can notify you when a new data asset has been created, or when data is moved into an unencrypted S3 bucket. In a perfect world, this might have been enough to mitigate security risks. But commonplace business practices, performed at scale, can generate hundreds or thousands of alerts in these solutions. Prioritizing the security risks requires taking data context into account.

Unlike CSPM, data security posture management (DSPM) is an effective way of detecting shadow data assets, and identifying the ones that need to be prioritized. . If CSPM will tell you that a database is running, the DSPM tool would go several steps further by classifying the records contained within, and providing configuration and content information. By scanning the actual records, DSPM can detect shadow data anywhere - whether it’s in a database or in file storage, managed or unmanaged. Highlighting and classifying sensitive records allows security teams to focus on data assets that pose the largest risk, either for security or compliance reasons.

Data detection and response (DDR) completes the picture by providing real-time monitoring of data assets, allowing security teams to quickly intervene when unwanted actions are being taken – such as a long-forgotten dataset suddenly being copied to S3, or an unusual snapshot being taken of a production database.

By combining posture management, static risk detection, and dynamic monitoring, companies can gain visibility and control while supporting data driven operations at scale.

Is Your Shadow Data a Problem? 3 Questions to Ask

Do you have an automated discovery tool that can notify you on new sensitive data assets in your environments and the safeguards around them?
Can you protect your cloud data without hindering development or infrastructure performance?
Would you be alerted in real time on suspicious actions involving your sensitive data?

Innovating from the cloud