Share via


JSON data in clustered column store indexes

Clustered column store indexes (CCI) in SQL Server vNext and Azure SQL Database support LOB types like NVARCHAR(MAX), which allows you to store string with any size, including JSON documents with any size. With CCI you can get 3x compression and query speedup compared to regular tables without any application or query rewrites. In this post we will see one experiment that compares row-store and column store formats used to store JSON collections.

Why would you store JSON documents in CCI?

Clustered column store indexes are good choice for analytics and storage  - they provide high compression of data and faster analytic queries. In this post, we will see what benefits you can get from CCI when you store JSON documents.

I will assume that we have one single column table with CCI that will contain JSON documents:

create table deals (``      data nvarchar(max),``      index cci clustered columnstore );

This is equivalent to collections that you might find in classic NoSQL database because they store each JSON document as a single entity and optionally create indexes on these documents. The only difference is CLUSTERED COLUMNSTORE index on this table that provides the following benefits:

  1. Data compression - CCI uses various techniques to analyze your data and choose optimal compression algorithms to compress data.
  2. Batch mode analytic - queries executed on CCI process rows in the batches from 100 to 900 rows, which might be much faster than row-mode execution.

In this experiment I'm using 6.000.000 json documents exported from TPCH database. Rows from TPCH database are formatted as JSON documents using FOR JSON clause and exported into the tables with and without CCI. The format of the JSON documents used in this experiment is described in the paper: TPC-H applied to MongoDB: How a NoSQL database performs, and shown on the following picture:

tpch-json

Ref: TPC-H applied to MongoDB: How a NoSQL database performs

JSON documents are stored in standard table with a single columns and equivalent table with CCI and performance are compared.

Compression

First we can check what is compression ratio that we are getting when we store JSON in collection with CCI. We can execute the following query to get the size of the table:

 exec sp_spaceused 'deals'

Results returned for table with and without CCI are:

  • Table with CCI 6.165.056 KB
  • Table without CCI 23.997.744 KB

Compression ratio in this case is 3.9x. Although CCI is optimized for scalar data compression, you might also get a good compression on JSON data.

JSON analytic

JSON functions that are available in SQL Server 2016 and Azure SQL Database enable you to parse JSON text and get the values from the JSON. You can use these values in any part of SQL query. An example of the query that calculates average value of extended price grouped by marketing segments is shown in the following sample:
select JSON_VALUE(data, '$.order.customer.mktsegment'), avg(CAST(JSON_VALUE(data, '$.extendedprice') as float)) from deals group by JSON_VALUE(data, '$.order.customer.mktsegment')

Instead of joining different tables you can just change the paths in the second parameter of JSON_VALUE function to select different fields from JSON that you want to analyze.

In this experiment we have simple 5 analytic queries that calculate average value of some price column from the JSON grouped by other json values (queries are similar to the query above). The same queries are executed both on row-store table and table with CCI on Azure SQL Db P11 instance, and the results are shown below:

Query Column store(sec) Row-store (sec)
Q1 11 18
Q2 15 33
Q3 17 36
Q4 18 39
Q5 21 51

Depending on the query, you might get 2-3x speedup in analytic on JSON data.

Conclusion

CLUSTERED COLUMNSTORE indexes provide compression and analytic query speed-up. Without any table changes, or query rewrites you can get up top 4x compression and 3x speed-up on your queries.

SQL Server 2016 SP1 and higher versions enables you to create COLUMNSTORE indexes on any edition (even in the free edition), but in this version there is a size constraint of 8KB on JSON documents.  Therefore, you can use COLUMNSTORE indexes on your JSON data and get performance improvements without any additional query rewrites.

Comments

  • Anonymous
    February 10, 2017
    Great Post :-)Thanks for sharing Jovan
  • Anonymous
    February 24, 2017
    This look absolutely awesome. I'm in the process of writing a completely new architecture for a flagship Cloud-product for a client (started some half year ago) and backthen I took the leap and chose for a "hybrid" rbms/nosql approach, ie. SQLServer with storing related data in JSON fields. At first when SQL2016 came out, I was a little afraid about the performance for querying JSON-data due to the lack of custom indexes, but it seems to go the right direction. As a simple contrived example, imagine I have 2 tables "Order" and "Product". And instead of having the classical "OrderLines" linktable, where one would say what amount of product is used in an order, I would have a JSON field "OrderLines" on my Order-table. Orderlines json contains than rows that have a productId and an amount (simplified example ofcourse)Then how would I construct an efficient query to say, "fetch all orders that have amount > 5 for product x" and what would be the performance difference if I just created a new table "Orderlines" and do it the "classic" way ?The advantages I see in doing it the json-way, is that when you fetch the order, you immediately also have the orderlines. Adding/removing orderlines and then saving the order also becomes easier (and more performant ? due to only 1 update statement). Correct ?The drawback is that the "productId" in the jsonfield no longer has referential integrety but in some case, that can be considered a good thing.
  • Anonymous
    August 22, 2018
    Clustered columnstore doesn't work for NVARCHAR(MAX) columns