Cloud 101CircleEventsBlog

Building a Better DSPM by Combining Data Classification Techniques

Published 02/08/2023

Building a Better DSPM by Combining Data Classification Techniques

Originally published by Sentra.

The increasing prevalence of data breaches is driving many organizations to add another tool to their ever growing security arsenal - data security posture management, or DSPM.

This new approach recognizes that not all data is equal - breaches to some data can have dire implications for an organization, while breaches to other data can be very alarming but will not cause major financial or reputational damage.

A way to fulfill this approach's potential is by mapping all data in organizations’ cloud environments, determining where sensitive data is stored, and who has access to it. Because some data is more sensitive than others, accurate classification of data is the core of a successful DSPM solution.

Unfortunately, there’s no single approach that can cover all data classes optimally, so we need to employ a number of classification techniques, scanning methods, verification processes, and advanced statistical analysis. By combining the strengths and weaknesses of different techniques, we can reach a high level of accuracy for all types of sensitive cloud data.

Let’s dive into some of these techniques to see how different methods can be used in different situations to achieve superior results.

The Power and Limits of Regular Expressions

Regular expressions are a very robust tool that can precisely capture a wide range of data entities at scale. Regular expressions capture a pattern - specific character types, their order, and lengths. Using regular expressions for classification involves looking at a string of characters - without any context - and deducing what entity it represents based only on the pattern of the string. A couple of examples where this can be used effectively are AWS keys and IP addresses. We know how many characters and what type of characters these entities contain.

However, the limitation of this approach is that if the pattern of the characters isn’t sufficient to classify the entity, a regular expression will need ‘help’. For example, a 9 digit number can represent a number of things, but if it is on a driver’s license it’s probably a license number, if it’s on a tax return, it’s probably a Social Security Number, etc.

Humans do this subconsciously all the time. If you hear someone’s name is ‘George’ you know that’s a common first name, and you will assume - usually correctly - that the individual’s first name is ‘George’ and not his last name.

So what we need is a classification engine that can make these connections the way humans do - one that can look at the context of the string, and not just its content. That’s one of the ways you can improve classification accuracy with a DSPM. You also could give the engine a list of names and tell it “these are first names” so that it’s able to accurately make these connections.

Another method to provide context is NER - Named Entity Recognition. This is a tool from Natural Language Processing (NLP) which can analyze sentences to determine the category of different words. Supplementing the limitations of regular expressions with these techniques is one way to ensure that you’re always using the best possible classification technique.

Of course, we still need to ensure that these patterns or data entities are actually the ones we’re looking for. For example, let’s say we identify a 16 digit number. This could be a credit card number. But it could also be a user ID, bank account number, a tracking number, or just a very large number.

So how do we determine if this is, in fact, a credit card number?

There are a number of ways we can confirm this.

(Note that these approaches are using the example of the credit card, but this can extend to various data classes):

  • Verify the integrity of the results: Credit cards have a check digit, the last digit in any card number, designed to avoid typos. We can verify it is correct. We can also verify the first few digits are in the ranges allowed for credit cards.
  • Model internal structure of the data: If data is in tabular form, such as a .csv file, we can create models of relationships between column values, so that only if, for example, 50% of values are valid credit card numbers will the whole column be labeled as such.
  • Look at the data’s ‘detection context’: If data is in tabular form, such as a .csv file, we can increase our certainty of a credit card detection if the column is named “credit card number”. The relationships between different columns can be used to add missing context, so a column suspected to hold credit card numbers will seem much more probable if there’s an expiration date column and a CVV column in the same table. When the data is in free form text format (as in a .docx file) this is much more complicated, and tools such as natural language understanding and keywords must be applied to accurately classify the data.

These are a few examples of methods that when combined together appropriately can yield results that are not only much more accurate, but also much more useful for explaining and understanding the reasoning behind these decisions.

Data classification has long been a challenge because of the limitations in different models. Only by using different methods in conjunction are we able to classify with the level of accuracy required to assist data and security teams responsible for securing large quantities of cloud data.

Share this content on your favorite social network today!