Four Ways You Can Lose Your Data

Published 04/11/2023

Written by Ravi Ithal, Cofounder and Chief Technology Officer, Normalyze.

Losing your data does not mean accidentally dropping it on the ground. Everyone knows data is vital to the organization, and entities spend big on protecting the “oil” of modern business. My use of “lose” is akin to the idea of data leaking out of our direct sight and control. Hidden, unprotected repositories of potentially sensitive data are prime targets for attackers. Finding these random data hotspots should be put on every security team’s priority list ASAP.

Let’s consider four reasons why data gets lost and where teams can look to find it.

1. Microservices Bring A Blizzard Of New Services—And Data Sources

The blame game starts with microservices. In the Paleolithic Age, when data lived only on-premises and served monolithic applications with multiyear development timelines, lost data mostly came from mechanical or software failure. No one worried then about lost data falling prey to attackers because when data was lost, it was Gone with a capital G! (Hence the invention of disaster recovery.)

With cloud and agile DevOps processes, the application architecture has switched to dozens or hundreds of microservices that are constantly deployed into production in a stream together, creating new internal and external apps. Data has moved into clouds and multiplied in physical and virtual holes throughout the environment. Duplication is normal and innocuous on the surface. For example, when a new feature is required or demand for new scale appears, the old database might not work. So, the developer migrates production data into a new datastore and fixes the issue. Perhaps the old service lingers awhile,e and, over time, developers forget about it—and its fellow database.

The database is not empty and is likely to contain sensitive data. Attackers who attain lateral access inside the cloud will have an eye out for abandoned databases. These are the least likely to be under strict access controls, so the data are ripe for picking.

Lesson No. 1: Find and remediate abandoned databases in all your environments.

2. AI/ML Modeling Fuels The Risky Use Of More Data Stores

AL/ML is all about data—the more, the better. Data bulk is important because it allows the models to learn better and faster. Learning accuracy is enabled by good production data, which tends to include sensitive information that needs protection. Our interest here is in a scenario that specifically affects small- to mid-sized companies that are less mature with AI/ML model management. Security teams usually are good at protecting production data in the cloud. But when a new AI/ML business case arises, data scientists need to move data from production into the model development environment to test hypotheses and enable model learning.

Without security controls managed by an MLOps platform or other means, placing production data into a model development environment can lead to data insecurity. This is especially true when team members make a copy of the data to run their own tests. Typically, major parts of the database are duplicated because it’s easier to start from scratch rather than to take the original model and append new data. The result can be significant data duplication. If these data reside in non-protected databases, they become honeypots for attackers.

Lesson No. 2: Find and remediate old, unused data in AI/ML model development environments and install controls to monitor who can access production data even if they appear in non-production environments.

3. CI/CD Accidentally Creates Shadow Data Stores

A shadow datastore is created by developers for use with the DevOps process, but it is not sanctioned by or even on the radar of security operations. Reasons for creating a shadow datastore might be for features that are experimental or for features that are in production without having a proper review. Typically, shadow datastores are not operated under standard access controls, nor is the data encrypted. Such behavior accompanies lax CI/CD processes that do not build in security best practices. When shadow datastores contain sensitive information (and many times they do!), they create a major vulnerability that is attractive to attackers who seek the least path of resistance to a breach.

Lesson No. 3: Find shadow datastores and implement standard security processes for data access and protection.

4. Placing A Key Under The Doormat And Other Innocent Ways Of Exposing Data

Laziness by engineers can trigger exposure of multiple datastores. Consider the power of access credentials used by engineers. These users often log onto two or more machines in production, which in turn have access to other resources. The process of entering credentials to access an additional system is an interruption to workflow—one that some engineers (being human) may be inclined to avoid. Their approach is to store the credentials to avoid having to reenter them multiple times. And since there are machines in the production environment that have keys and secrets—voila, data is potentially exposed to attackers, insiders or otherwise.

Lesson No. 4: Organizations must scan for potentially exposed data or access to it in unusual places. You never know where it might reside!

Clearly, there are many ways to lose data and expose it to attackers. The scenarios described above are just conversation starters. A comprehensive risk management strategy must include processes for universally scanning the entire enterprise environment for lost or forgotten data. Until you can find all your data and know if sensitive information is exposed, the risk of a successful breach will always be with your organization.

Application Containers and Microservices Artificial Intelligence DevSecOps Enhancing cloud security strategy