Big Data, Big Cloud, Big Problem
Published 08/15/2012
By Todd Thiemann
Big data presents a big opportunity for businesses to mine large volumes of data from a variety of sources to make better and more high velocity decisions. Since big data implementations are practically always deployed in a cloud environment, be it a private cloud or public cloud, this poses a major security challenge. That’s because some of that “Big Data” will inevitably be sensitive in the form of intellectual property covered by corporate security mandates, cardholder data affected by PCI DSS, or Personally Identifiable Information (PII) affected by state or national data breach laws.
For the purposes of this article, our definition of Big Data refers to the non-relational storage and processing technologies including NoSQL tools such as Hadoop, MongoDB, Cassandra and CouchDB. These offerings comprise the bulk of “Big Data” deployments and share similar security challenges. For example, The Hadoop Distributed File System (HDFS) is used to store data that needs to be analyzed. Software frameworks such as MapReduce or Scribe process large amounts of data in parallel on large clusters of commodity computer nodes. Tasks are distributed and processed in a completely parallel manner across the cluster. The framework sorts the output, which can be used as input to the reduce tasks. Typically both the input and the output of the job are stored across the cluster of compute nodes.
The ability to perform complex ad-hoc queries against massive disparate datasets can unlock tremendous value for enterprises. In order to tap this intelligence, companies are using distributed file systems such as Hadoop. This is primarily because the volume of data has increased beyond the performance capabilities of relational database systems.
While traditional relational databases use the concept of a data container, this is absent in the Big Data world. Instead of a datafile associated with a database, NoSQL implementations scatter files across hundreds or thousands of nodes. As a result, sensitive data that requires protection is no longer in one compact tablespace on a single system, but can be scattered among a multitude of nodes in the cloud.
One of the key challenges posed by NoSQL tools is that while they are great at crunching massive volumes of data, they have virtually zero built-in security or access control capabilities. If a Big Data deployment includes or will include sensitive data, it’s imperative to put data security and access controls in place. Operating a Big Data infrastructure without some form of security is a very high risk endeavor.
The following threats and how to mitigate them are important considerations in Big Data environments:
Privileged User Abuse – keeping system administrators from accessing or copying sensitive data.
Unauthorized Applications – preventing rogue application processes from touching your Big Data.
Managing Administrative Access – While system administrators should not be allowed to access data, they may need access to the directory structure for maintenance operations and performing backups.
Monitoring Access – Understanding who is accessing what data in a Big Data repository allows for necessary auditing and reporting.
When it comes to protecting and controlling access to Big Data, encryption combined with key management are central elements of a layered security approach. Here are some important considerations when securing Big Data environments:
- Classify data & threats – This is one of the biggest challenges for any data security project – knowing what is sensitive, where is it located, what are the potential threats. If no sensitive data is in scope, data protection may not be necessary. If sensitive data is stored in the Big Data environment, it needs to be protected. Talking to the Big Data development team about the nature of the data is a first step.
- Encryption & Key Management – Taping the key to the front door just above the door knob is not a security best practice. In the same vein, storing encryption keys within the data environment they are protecting is also not a best practice.
- Separation of Duties – this has many implications, but one is that encryption keys should never be under the control of IT administrators.
- Costs – Minimizing silos of encryption and key management typically reduces costs and minimizes scalability, audit, and total cost of ownership issues.
- Performance – Enterprises are embracing Big Data for its potential to enable faster decision making. By the same token, encryption and key management should not significantly slow down Big Data system performance
Big Data promises to be the proverbial goose that lays golden eggs. Understanding the data security and privacy risks associated with a Big Data environment early in the development process, and taking appropriate steps to protect sensitive information, will prevent that goose from getting cooked.
Todd Thiemann is senior director of product marketing at Vormetric and co-chair of the Cloud Security Alliance (CSA) Solution Provider Advisory Council.