The effects of column level compression
The Azure SQL Data Warehouse service is built on SQL Server and extends the capabilities of SQL Server for large scale relational data warehouse workloads. One of the areas of interest is how to you optimize for databases with terabytes or petabytes of data. One answer is Clustered Columnar indexes (CCI). A columnar index is a technology for storing, retrieving, and managing data by using a columnar data format – called a columnstore. The idea is to store data in columns vs. row format which allows for the data to be compressed. Since the data is compressed, the impact is smaller storage requirements and faster data movement (you’re moving smaller amounts of data). We won’t dive into the depths of columnar indexing here, there are plenty of guides that cover this in depth (see Columnstore Index Guide, Columnstore Index: What is new in SQL Server 2016?, and CREATE COLUMNSTORE INDEX).
Now why do you care? When using CCI with SQL Data Warehouse, you’ll often be presented with this problem: I loaded n MB/GB/TB of data into the service, ran DBCC PDW_SHOWSPACEUSED, summed the DATA_SPACE column and see that the space used is dramatically smaller than the amount of data you put in.
An Example
- Scenario #1: You have a 750 MB csv file (represents 1 hour of data) that you generated from your business and have loaded the data via PolyBase into SQL Data Warehouse. You check the database via DBCC PDW_SHOWSPACEUSED and see that you have about 1.4 GB of data.
- Scenario #2: You use that same 750 MB csv file with the other 23 from that day (to represent a full day of data about 14GB of data), loaded in SQL Data Warehouse, and the results of DBCC PDW_SHOWSPACEUSED is 850MB.
So what magic is happening to make the data on disk so much smaller? Our friend, CCI!
A Simple Explanation
In scenario #1, there are not enough rows in the data to meet the criteria for compression. SQL Server will optimize and move the data into Clustered Index (CI) format which does not enable compression. The difference in size is due to the table storage, CI storage, and the possibility that all of the data pages are not fully packed.
In scenario #2, there are enough rows in the data to meet the criteria for compression. In this case the data is very compressible and we achieve around 14x compression.
Best Practice
- CCI works best with data sets above 1M rows. This gives optimal row group sizes for compression and query performance.
Monitoring
To monitor CCI data, you can inspect the metadata for the table to determine the open rowgroups and rows (min, max, and average) as an indicator if CCI is the correct indexing strategy. We have a great tutorial on Managing Columnstore Indexes that walks you through how to determine rowgroups, rowgroup rows/counts, and a couple of strategies for tuning your system to get the best possible performance.
Next Steps
Visit the SQL Data Warehouse Overview to learn more about Microsoft’s scale out relational data warehouse.