What Is Data Deduplication and Who Cares?

Published 01/19/2016

By Rachel Holdgrafer, Business Content Strategist, Code42

code42 data duplication blog Data deduplication is a critical component of managing the size (and cost) of a continuously growing data store that you will hear about when you research endpoint backup. Intelligent compression or “single-instance storage” eliminates redundant data by storing one copy of a file and referencing subsequent instances of the file back to the saved copy.

There is some misunderstanding of deduplication in the marketplace even among analysts, in part because vocal endpoint backup vendors have positioned deduplication capabilities around the concept of upload speed and cost of storage rather than security and speed to recovery.

What is data deduplication?

Data deduplication is a process by which an enterprise eliminates redundant data within a data set and only stores one instance of a unique piece of data. Data deduplication can be completed at the file level or at the data block level and can occur on either the endpoint device or the server. Each of these variables plays a role in how deduplication works and its overall efficiency, but the biggest question for most folks is, “Does data deduplication matter?” Or is data deduplication a differentiator that I should care about?

If you are considering a robust and scalable enterprise endpoint backup solution, you can count on the fact that the software uses some sort of data deduplication process. Some solutions use global deduplication, others local deduplication and some use a combination of the two.

Local deduplication happens on the endpoint before data is sent to the server. Duplicate data is removed from the endpoint and then clean data is stored in a unique data set sorted by user archive on the server. Each data set is encrypted with a unique encryption key.

Global deduplication sends all of the data on an endpoint to the server. Every block of data is compared to the data index on the server and new data blocks are indexed and stored. All but one identical block of data is removed from the data store and duplicate data is replaced with a redirect to the unique data file. Since multiple users must be able to access any particular data block, data is encrypted using a common encryption key across all sets.

Regardless of the deduplication method used, the actual process should happen silently in the background, causing no slow-down or perceived impairment for the end user.

So, should I care about global deduplication?

In short, not as much as some vendors might want you to care. Data deduplication—whether global or local—is largely considered table stakes in the world of enterprise endpoint backup. There are instances where each type may be beneficial—the key is to understand how each type affects your stored data, security requirements and restore times.