Cloud 101CircleEventsBlog
Master CSA’s Security, Trust, Assurance, and Risk program—download the STAR Prep Kit for essential tools to enhance your assurance!

Shining a Light on Shadow Data: What It Is and How to Minimize It

Published 06/07/2022

Shining a Light on Shadow Data: What It Is and How to Minimize It

This blog was originally published by Polar Security here.

Written by Roey Yaacovi, Polar Security.

Shadow data can be defined as any data that is not available or subject to an organization’s centralized data management framework. Examples of shadow data include:

  • Customer data that was copied from production to development for testing purposes
  • Data stores with sensitive data for a legacy application that isn't used anymore
  • Byproducts of an application, such as log files that contain sensitive data
  • Hard-to-find local databases used by applications
  • Data that is generated by shadow IT (i.e., apps or services that run outside the purview of central IT)
  • Siloed data only available to a specific line of business

Shadow data is, first and foremost, a problem for security and compliance teams. How can they take responsibility for the security and compliance of data stores they aren’t aware of or over which they have little or no control? This inherent tension between DevOps and security teams has been amplified by regulatory frameworks, such as GDPR and CCPA, that set and enforce high data privacy standards.

In addition, shadow data affects operations teams because unmanaged data sprawl often drives up infrastructure costs. Cloud budgets are exceeded, with little or no visibility into how the overspend was incurred or how it can be contained.

In this blog post, we’ll look at where shadow data comes from, why it’s dangerous, and how you can manage it more effectively.

Why Shadow Data Can No Longer Be Ignored

Due to the fact that complex hybrid and multi-cloud environments have become the norm, shadow data can no longer be ignored. According to the Flexera 2021 State of the Cloud Report, 92% of enterprises today have a multi-cloud strategy—82% of which have embraced a hybrid approach. These environments are notoriously difficult to monitor effectively, and poor visibility is an invitation for shadow data to accumulate.

The adoption of cloud-driven CI/CD methods is also causing an increase in shadow data. Today, developers are freer to push new products and features to market than ever before. The other side of that coin, however, is that the self-service cloud model lets developers provision data stores in a few clicks—often without regard for the organization’s governance and compliance policies.

The emergence of highly distributed cloud-native apps based on microservices, containers, and serverless functions has brought the issue of shadow data even more to the forefront, as decentralized workload-based data stores are a major contributor to data sprawl.

Real-Life Examples of Shadow Data

The ironic thing about shadow data is that we all have it, but, by definition, don’t know that we have it—or where it is, or how much we have. Here are three real-life stories that illustrate how shadow data easily creeps into an organization.

  • Unmanaged public-facing databases: A developer implemented an SQLite database for storing sensitive data input by web app users. The problem? The database was deployed on a standard web server that was, by definition, public facing. In clear violation of the company’s security and compliance guardrails, sensitive user data was exposed to threat actors—and the security team wasn’t even aware of its existence.
  • Data that is generated by backend applications: Debug dumps, log files, and backup files serve DevOps engineers and typically fly under the radar of security. But they can contain sensitive shadow data. One of our design partners said that they don’t process or store any PCI data—until they discovered that one of their microservices creates a local debug dump that includes unencrypted credit card information.
  • Unmanaged cloud resources: As part of internal, and, therefore, unaudited, testing procedures, developers might create an S3 bucket in a restricted geolocation. In addition to incurring security and compliance risks, these testing resources unnecessarily add to infrastructure costs if they are not decommissioned properly.

A Shadow Data Quiz

Here are five questions you should ask yourself to assess how great of a risk shadow data might be to your organization:

  • Do you have a complete and always-up-to-date inventory of your data assets across clouds, accounts, and lines of business?
  • Is your data effectively and continuously categorized according to sensitivity, utility, etc.?
  • Can you definitively map the flow of data across your cloud environment?
  • Can you be sure that all data stores are fully aligned with your organization’s security posture requirements and policies—at scale?
  • Is all valuable data available to corporate data management and data analytics systems? Have you effectively broken down shadow data silos?

If you can honestly answer “yes” to these questions, you can stop reading now. Otherwise, read on for some best practices that can mitigate shadow data, in general, and its risk, in particular.

How to Minimize Shadow Data Risks

There are a number of data management best practices that can help mitigate the risks associated with shadow data, including:

  • Adopt a shift-left approach: Data security should be an integral part of the software development lifecycle (SDLC), with the appropriate security and compliance guardrails applied from the earliest planning and design stages. Throughout the SDLC, developers should be able to answer questions such as:
  • What data will the application use or collect?
  • Where will that data be stored?
  • Which users and resources require access to the data?
  • How will the data be backed up?
  • Know your data: Scan your workloads continuously to create and maintain catalogs in which data assets are categorized by sensitivity and criticality. After all, why should you waste human and system resources protecting data stores that present no or minimal risk to your business? The data catalog should be comprehensive (cross organizational), accessible to all stakeholders, and searchable according to a wide range of parameters, such as owner, sensitivity, used by, version, and so on.
  • Follow your data: Given the scale and complexity of data in a typical mid-sized to large company, a next-generation data catalog should also provide visualizations that call out relationships, flows, and dependencies among data stores. If you can build out flows and understand who is interacting with what data, you can identify shadow data that no one is actually utilizing. This unused data is not only taking up expensive storage resources, but is also a target for cybersecurity risks, such as data exfiltration.
  • Clean your data: Keeping your data assets lean and mean is an important part of IT hygiene. Ideally, each time a developer replicates a data store for testing or an operations person mirrors a database prior to an upgrade, (s)he should delete it when the testing environment is decommissioned or the upgrade has completed successfully. In real life, however, your environment is full of redundant, incomplete, or low-value data stores. Have processes in place to regularly get rid of this shadow data.
  • Proactively protect your data: You must include shadow data in your company’s risk-based data protection policies and processes. The risk assessment, which should be as automated as possible, will take into account the type of shadow data, its location, and, based on its sensitivity, any relevant compliance requirements. You should then apply the appropriate guardrails, including access controls, minimal privileges, checking for anomalous behavior, alerting to threats, and remediating misconfigurations.

Share this content on your favorite social network today!