Data Deduplication Overview
What is Data Deduplication?
Data Deduplication, often called Dedup for short, is a feature that can help reduce the impact of redundant data on storage costs. When enabled, Data Deduplication optimizes free space on a volume by examining the data on the volume by looking for duplicated portions on the volume. Duplicated portions of the volume's dataset are stored once and are (optionally) compressed for additional savings. Data Deduplication optimizes redundancies without compromising data fidelity or integrity. More information about how Data Deduplication works can be found in the 'How does Data Deduplication work?' section of the Understanding Data Deduplication page.
Important
KB4025334 contains a roll up of fixes for Data Deduplication, including important reliability fixes, and we strongly recommend installing it when using Data Deduplication with Windows Server 2016 and Windows Server 2019.
Why is Data Deduplication useful?
Data Deduplication helps storage administrators reduce costs that are associated with duplicated data. Large datasets often have a lot of duplication, which increases the costs of storing the data. For example:
- User file shares may have many copies of the same or similar files.
- Virtualization guests might be almost identical from VM-to-VM.
- Backup snapshots might have minor differences from day to day.
The space savings that you can gain from Data Deduplication depend on the dataset or workload on the volume. Datasets that have high duplication could see optimization rates of up to 95%, or a 20x reduction in storage utilization. The following table highlights typical deduplication savings for various content types:
Scenario | Content | Typical space savings |
---|---|---|
User documents | Office documents, photos, music, videos, etc. | 30-50% |
Deployment shares | Software binaries, cab files, symbols, etc. | 70-80% |
Virtualization libraries | ISOs, virtual hard disk files, etc. | 80-95% |
General file share | All the above | 50-60% |
Note
If you're just looking to free up space on a volume, consider using Azure File Sync with cloud tiering enabled. This allows you to cache your most frequently accessed files locally and tier your least frequently accessed files to the cloud, saving local storage space while maintaining performance. For details, see Planning for an Azure File Sync deployment.
When can Data Deduplication be used?
Scenario illustration | Description |
---|---|
General purpose file servers: General purpose file servers are general use file servers that might contain any of the following types of shares:
|
|
Virtual Desktop Infrastructure (VDI) deployments: VDI servers, such as Remote Desktop Services, provide a lightweight option for organizations to provision desktops to users. There are many reasons for an organization to rely on such technology:
|
|
Backup targets, such as virtualized backup applications: Backup applications, such as Microsoft Data Protection Manager (DPM), are excellent candidates for Data Deduplication because of the significant duplication between backup snapshots. | |
Other workloads: Other workloads may also be excellent candidates for Data Deduplication. |