Securing Data Lakes in a Data Centric World
Written by Dr. Diane Murphy, Marymount University and Oliver Forbes, NTT DATA
Data allows the business of today to optimize performance, investigate fraud and discover solutions to problems that we didn't even know the question to. At the heart of such are the inner workings of expansive data lakes. For clarity, a data lake is a repository of data created by ingesting heterogeneous datasets from various sources (local and/or external) and storing the data in its native (raw) format using a mostly flat architecture. The data lake facilitates the on-demand cross-analysis of the data from the datasets without the need for extensive pre-processing. New datasets can be readily added on the fly. User requirements are not completely defined when the data lake is designed and implemented, recognizing the changes in technology, data, and business needs that happen over time. Most greenfield data lake solutions are being built in the cloud.
Datasets can come in many forms: raw data, streaming data, application data (including in a database or a data warehouse), textual data, or archived data. Data lakes are designed to handle the large volumes of data available today, the variety of data encountered (structured, semi-structured, and unstructured), as well as the high velocity of today's data (expanded availability, complexity, and change). At the same time, they must effectively and efficiently meet the advanced and ad-hoc query needs of business intelligence, machine learning, and artificial intelligence approaches, where the nature of the data to be searched and analyzed is not readily prescribed.
Data Lakes and the Cloud
Cloud technology has become a significant component of many data lakes. One primary consideration is the elasticity of both the storage and computing resources. Elasticity enables the data lake to grow with increased and more extensive datasets and query performance to be maintained as queries get ever larger and more complex. The cloud also allows access from anywhere in the world and many cloud providers offer a variety of technology tools to implement a data lake.
Data Lakes vs. Data Warehouses
The main differences between a data warehouse and a data lake reflect the unknowns in the environment today. A data warehouse focuses on the data and requirements as known, and only predefined requirements can be answered. The data is integrated, highly structured, and decision-oriented. As data is ingested, it uses extensive pre-processing during the extraction, transformation, and load (ETL) process.
Data Lake Considerations
A data lake must be well designed and maintained; otherwise, it quickly becomes inoperable (a data swamp or toxic data lake). In such cases, data is difficult to understand, becomes costly to maintain, and the business value shrinks or does not materialize. In addition, the data lake must enable good response times for a large range of queries and be capable of handling new datasets with differing characteristics. To accomplish such, good data governance is essential.
Data lakes are critical to the digital enterprise, and there are severe repercussions if the data is not correctly secured. Data lakes may suffer from data breaches, data drift, data poisoning, malware obfuscation, cyber-attacks, and violation of data privacy regulations
In summary, today's digital business is built on the judicious use of data. Data lakes, however, must be closely governed and at the forefront of security interests. Our task as security advocates is to meet this challenge head-on, leading a visionary-focused front on how security should truly take shape, molding the security compliance, policy, and data management of a "Data Centric World." With the CSA Artificial Intelligence Working Group, we'll start our project by building out a threat model for data lakes. Join the working group today to learn more about the project and help us secure Data Lakes!
Sign up to receive CSA's latest blogs
This list receives 1-2 emails a month.