Hunting Shadow Data: A Guide for Security and Compliance Teams
Originally written by Sentra.
What is Shadow Data?
Shadow data is any organizational data that has been copied, backed up or is otherwise stored such that it is not subject to your organization’s centralized (and secured) data management framework. For example, shadow data may not be housed according to your preferred security structure. It may not be subject to your access control limitations, or visible to the tools you use to monitor and log data access.
Shadow data is the ultimate ‘known unknown’. You know it exists, but you don’t know where it is exactly. And, more importantly, because you don’t know how sensitive the data is you can’t protect it in the event of a breach.
Where Does Shadow Data Come From?
Whether it’s created inadvertently or on purpose, data that becomes shadow data is simply data in the wrong place, at the wrong time.
Consider the classic example of customer data that was copied to a dev environment from production, to be used as test data – then was never erased or backed up to a less secure location. This data was secure in its organic location, and never intended to be copied – or at least not copied and forgotten. But this type of human error is common.
Another great example of where shadow data comes from is decommissioned legacy applications. What happens to historical customer data or PII when you port to a brand new application? It’s often just left sitting where it was originally stored, waiting to be deleted - or not.
Then there is data produced by shadow IT applications and stored locally, or databases used by niche apps that are stored locally. Again, this is legitimate and live business data but it’s stored in the wrong location. Shadow IT applications also produce toxic logs, which frequently contain sensitive information yet remain unsecured owing to the off-the-books nature of shadow IT activity.
Finally, consider the constant flux of the average enterprise IT ecosystem. When architecture changes occur, whether in private data centers or in the cloud, lift and shift often leaves behind data leftovers - unmanaged or orphaned backups, for example – that can contain sensitive information.
Why has the Shadow Data Problem Been Getting Worse?
In on-prem architectures, shadow data is a local problem. Someone keeps data on a server that shouldn’t be used for that purpose. But on-prem environments have many perimeter security solutions designed to prevent unauthorized access. In the cloud, however, there is no network perimeter and a simple misconfiguration can cause a large data breach. When you factor in the ability to create, duplicate and move data and the scale and speed of the cloud, data protection teams simply can’t keep up. So it’s natural we’d see this problem grow with the cloud. More reasons include:
The prevalence of multi-cloud environments:
Different cloud environments offer different benefits - but they also create more opportunities for data to be moved, copied, and forgotten. Anywhere that sees a lot of data movement is likely to generate shadow data, and the more cloud environments, the greater the chance something is missed.
Cloud-driven CI/CD speeds up interactions between development pipelines and source code repositories - making life easier for developers. However, when self-service cloud models let developers self-provision data stores on the fly, data security policies can fall victim to expediency – leading to shadow data.
As organizations embrace distributed cloud-native apps based on containers, serverless functions, and microservices, data ends up spread across hundreds of databases, shared storage, data warehouses, data pipelines and more. Data also flows to external SaaS services or warehouses, like Snowflake or Databricks. All of this movement makes it increasingly tricky to map and track data, again leading to the creation of multiple shadow data repositories.
The Business Impact of Shadow Data
Shadow data is both a security and compliance issue, which makes it a major business concern for security, legal, and compliance teams - and this includes the C Suite.
For attackers, shadow data is the ultimate low hanging fruit – publicly accessible sensitive data that nobody really knows is there. By our estimations here at Sentra, some 15% of all enterprise data is shadow data.
Shadow data can also present an unacceptable compliance risk. Personal, financial, and healthcare data are subject to strong compliance standards, and shadow data that contains these types of sensitive data must be found and remediated by law.
Finally, it’s worth noting that cloud data storage is not free. Especially at the enterprise level (but not just), there is a literal price for unknown and unnecessary cloud storage. Consolidating data stores and eliminating shadow data can have a measurable impact on the bottom line.
How DSPM Can Help Find, Classify, and Secure Shadow Data
The trick with shadow data, as with any security risk, is not just in identifying it – but rather prioritizing the remediation of the largest risks. DSPM follows sensitive data through the cloud, helping organizations identify and automatically remediate data vulnerabilities by:
Finding shadow data where it’s not supposed to be:
DSPM is able to find all of your cloud data - not just the data: stores you know about.
Finding sensitive information across different security postures:
Finding sensitive data that doesn’t seem to have an adequate security posture
Finding duplicate data:
DSPM tools can discover when multiple copies of data exist, tracks and monitors them across environments, and understands which parts are both sensitive and unprotected.
Taking access into account:
Sometimes, data can be data that’s actually in the right place, but accessible to the wrong people. A good DSPM scrutinizes privileges across multiple copies of data, identifying who can access data and who should not be able to.
Sign up to receive CSA's latest blogs
This list receives 1-2 emails a month.