How to use the GZIP compression while inserting multiple rows at the same time

Question

How to use the GZIP compression while inserting multiple rows at the same time

Andreas ss 726

Hello!

I use: Enterprise core edition SQL server

I am trying to understand how to compress a SQL table. Now I have looked at the GZIP algorithm, that would be interesting to test out as it offers a very good compression. This is my first time when it comes to compression, so I am not exactly sure how I will do this correctly. I have looked at the below link which shows that a compressed table only takes up 6% of the space which means that it offers very good compression.
https://www.mssqltips.com/sqlservertip/5709/using-compress-and-decompress-in-sql-server-to-save-disk-space/

I will try to tell what I now do step by step without compression.

Step 1: I create this table

CREATE TABLE [dbo].[table123] (
    [_DateTime]    SMALLDATETIME DEFAULT (getdate()) NOT NULL,
    [_DayNr]       TINYINT       DEFAULT ((0)) NOT NULL,
    [_CategoryNbr] TINYINT       DEFAULT ((0)) NOT NULL,
    [_FeatureNbr]  SMALLINT      DEFAULT ((-1)) NOT NULL,
    [_Value]       FLOAT (53)    NULL,
    [_Bool]        BIT           NULL,
    CONSTRAINT [PK_table123] PRIMARY KEY CLUSTERED ([_DayNr] ASC, [_DateTime] ASC, [_CategoryNbr] ASC, [_FeatureNbr] ASC),
    CONSTRAINT [UC_table123] UNIQUE NONCLUSTERED ([_FeatureNbr] ASC, [_DateTime] ASC)
);

Step 2: I now insert many rows at a time, usually about 100 rows at a time, to save overhead time and make the inserts faster more efficient

INSERT INTO table123 (
    _DateTime,
    _DayNr,
    _CategoryNbr,
    _FeatureNbr,
    _Value,
    _Bool
)
VALUES
    ('2010-08-01 17:00:00', 1, 1, 1, 12.4578941564, 1),
    ('2010-08-02 17:00:00', 2, 1, 1, 13.4578941564, 1),
    ('2010-08-03 17:00:00', 3, 1, 1, 14.4578941564, 1),
    ('2010-08-04 17:00:00', 4, 1, 1, 15.4578941564, 1);

Many questions at the same time now:
1. Am I thinking correct by saving overhead/time by inserting multiple rows at a time like this?
2. Now I simply wonder. How can I apply the GZIP compress algorithm for all those 4 rows(with all column values) while I insert them to the table like above, - is my big question now, I beleive. At the very same time, I wonder if the GZIP algorithm would offer the best/one the best compressions just thinking of taking up as low storage on the disk as possible. If not please be free to tell me a better compression method for this table?
I beleive it must be possible to query/extract the values without uncompressing the data, if this is possible(Here then thinking of if not all compression methods can do that)

Thank you!

Accepted answer

4 additional answers

Your answer

Answer 1

tibor_karaszi@hotmail.com 4,316

A tip is that queries with high selectivity (the returns relatively few rows) are likely to not perform as well with a columnstore index as with a suitable "traditional" index (row index).

There are no SEEK in a columnstore index. If you are lucky, there are rowgroup elimination.

The pointer from a row index is the rowgroup where the row lives in the clustered columnstore index. Remember that the clustered columnstore index is the data. So after finding each row in the row index, SQL server has to scan about 1,000,000 rows in the columnstore index to find the one you were looking for. For each row.

Unless the row index covers the query, that is (all the columns that the qurery refers to is in the row index).

There is much much more to columnstore index that first meet the eye. Make sure you understand the architecture etc for columnstore indexes before deciding on them.

Andreas ss 726 Reputation points

2020-11-11T22:30:28.32+00:00

Yes, when I am reading on the link, I could see that this must be one thing that is very important. I would return for example 1570 rows of 3 million rows. I could understand also that it does a tablescan which is a big process there. It is great information you give. It is quite complex to understand when reading.

I think I today will be able to do a benchmark. Page vs columnstore compression for the exact scenario I have. I will post it soon. It will be interesting to see.
Andreas ss 726 Reputation points

2020-11-11T23:46:33.747+00:00
I have now done a benchmark test using PAGE compression and CLUSTERED COLUMNSTORE compression. I understand that it must be very different for different kind of scenarios.

This is the outcome with a table with: 3,124,300 rows. Where the query will always return: 1570 rows:

PAGE takes about 4 ms (Table data: 45280 KB)

CLUSTERED COLUMNSTORE takes about 12 ms (Table data: 19960 KB)

If I then understand, it takes about 3 times as long time using CLUSTERED COLUMNSTORE vs PAGE and space saved on disk is about HALF. So it is a tradeoff to think about I beleive.

I am not sure with the information I have given, if there could be any fundamental speed improvement that could be done for CLUSTERED COLUMNSTORE?
(Probably a difficult question to answer, I am not sure?)

Query I use:

SELECT _FeatureNbr, _Value, _Bool INTO #tmps FROM ForexForexEurusd15Min20114 WHERE _DayNr = 15 AND _DateTime = '2011-04-15 01:15:00'
Andreas ss 726 Reputation points

2020-11-12T00:04:14.633+00:00
I remember you mentioned that it is a good idéa to return as many rows as possible in one Query. I might be able to do this actually like below which should work fine for the logic that needs this data like this as an example:

WHERE _DateTime = '2011-04-15 01:15:00' OR _DateTime = '2011-04-15 01:30:00' OR _DateTime = '2011-04-15 01:45:00' OR _DateTime = '2011-04-15 02:00:00'

The above then returns: 1570 * 4 = 6280 rows and takes 20 ms VS 1570 rows that took 12 ms. So the more rows returned in oney query should increase speed in that way?
tibor_karaszi@hotmail.com 4,316 Reputation points

2020-11-12T09:38:21.097+00:00

IT is impossible for us to guess as to further improvements comparsions between page compression and columnstore. We would have to have access to your database and your queries. and you might have more than one query. And only you can weigh in the advantage of smaller data set vs faster query time. Etc. And, again, keep reading Hugo's article series regarding columnstore - it is well worth the time before you decide on using that.
Andreas ss 726 Reputation points

2020-11-12T16:54:36.053+00:00

Tibor, yes that is true. I do understand that it is very difficult to know without knowing every detail.
It is so many combinations of how a database is structured. As you say, the best is to experiment and read. It was a great reading you posted, it really goes in depth there how it works. It was a good start I got from you to start trying it out.

Thank you for your help!
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2020-11-12T22:43:14.817+00:00

Small correction: Nico's blog, not Hugo's. Hugo's blog also has lot of useful information, but not so much about columnstore I think. (And absolutely not as much as Nico!)

If you are reading only 1570 rows out of many, and you are get all columns, columnstore is not likely to be best bet. A nonclustered B-Tree index on the table could help, but it will take up space, and given that your table is that narrow, may be losing on the swings what you gain on the roundabout.
tibor_karaszi@hotmail.com 4,316 Reputation points

2020-11-13T08:09:00.617+00:00

Erland, allow me to rollback that correction :-). Nico has indeed the "bible" blog on columnstore, but Hugo did a great job writing a "lift" series of blog posts for SQLServerCentral. I linked to this in another reply in this thread. What I like with Hugo's series is that you can read them from start to end, and the blog posts will take your knowledge from zero to pretty high. I.e., a great starters guide to Columnstore.
https://www.sqlservercentral.com/stairways/stairway-to-columnstore-indexes
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2020-11-13T09:18:28.193+00:00

Oops! Sorry about that!
Erland Sommarskog 121.4K Reputation points MVP Volunteer Moderator

2020-11-13T09:21:13.05+00:00

Yes, reducing roundtrips is always a good idea. And assuming that you have your data in intervals of fifteen minutes, is even better to say

WHERE _DateTime BETWEEN '2011-04-15 01:15:00' AND '2011-04-15 02:00:00'

Now SQL Server knows that you are looking at full interval and not just random points.

Answer 2

Erland Sommarskog 121.4K MVP Volunteer Moderator

Here is your table with a clustered columnstore index:

CREATE TABLE [dbo].[table123] (
[_DateTime] SMALLDATETIME DEFAULT (getdate()) NOT NULL,
[_DayNr] TINYINT DEFAULT ((0)) NOT NULL,
[_CategoryNbr] TINYINT DEFAULT ((0)) NOT NULL,
[_FeatureNbr] SMALLINT DEFAULT ((-1)) NOT NULL,
[_Value] FLOAT (53) NULL,
[_Bool] BIT NULL,
CONSTRAINT [PK_table123] PRIMARY KEY NONCLUSTERED ([_DayNr] ASC, [_DateTime] ASC, [_CategoryNbr] ASC, [_FeatureNbr] ASC),
CONSTRAINT [UC_table123] UNIQUE NONCLUSTERED ([_FeatureNbr] ASC, [_DateTime] ASC),
INDEX colstore_ix CLUSTERED COLUMNSTORE
);

I don't have the syntax for COLUMNSTORE_ARCHIVE around, but as Tibor says, you should first get acquainted with columnstore to determine whether it is for you.

Andreas ss 726 Reputation points

2020-11-11T22:27:31.277+00:00

Thank you, that was great. I am working with this code to see if I can benchmark this against for example Page Compression.

Answer 3

tibor_karaszi@hotmail.com 4,316

GZIP compression is suitable when you have one column which can hold lots of data. Typically an nvarchar(max). In your case, you only have tiny columns, so this compression (which work at the column level) isn't suitable for you.

Possibly Data Compression is a better choice: https://learn.microsoft.com/en-us/sql/relational-databases/data-compression/data-compression?view=sql-server-ver15

Os, perhaps even having a clustered columnstore index on the table, but that is pretty much aimed at data analysis rather than OLTP type of work.

Andreas ss 726 Reputation points

2020-11-10T14:59:37.257+00:00
Yes I see, so GZIP is only better suited for one column which holds alot of data.

I have tried using the PAGE compression with below result which gives some good compression: (Image: Uncompressed and Compressed)

ALTER INDEX ALL ON ForexForexEurusd15Min201112 REBUILD PARTITION = ALL WITH (DATA_COMPRESSION = PAGE); ---COLUMNSTORE_ARCHIVE

There are 2 questions I then have. As my database will take up at least 28 TB later. I like to experiment with the: COLUMNSTORE_ARCHIVE and see if it offers even better compression even if speed will be reduced. As for how my table looks like now I can't use it.

You mentioned: clustered columnstore index. I am not sure how I should do this?

The code above, makes a compression AFTER the table is filled with data. Is it possible to use: COLUMNSTORE_ARCHIVE while data is inserted into the table to directly save space?

Answer 4

Hi @Andreas ss ,

Data be compressed using the GZIP algorithm format. This is most suitable for compressing portions of the data when archiving old data for long-term storage. Data compressed using the COMPRESS function cannot be indexed. Please refer to COMPRESS (Transact-SQL).

As TiborKaraszi mentioned, you can try to use Data Compression, please refer to the blog How to use SQL Server Data Compression to Save Space.

Best regards,
Cathy

If the response is helpful, please click "Accept Answer" and upvote it.
Hot issues October--Users always get connection timeout problem when using multi subnet AG via listener. Especially after failover to another subnet

Answer 5

Tp put it bluntly, you shouldn't use columnstore unless you are pretty familiar with columnstore. I'm talking about the architecture behind columnstore, the advantages and the disadvantages. It is mainly designed for data warehousing, and I wouldn't recommend using it unless you pretty confident in what you are doing.

1: Instead of creating a regular clustered index, you create a clustered columnstore index.

2: The code you posted doesn't do columnstore_archive compression, it does data compression, page. Anyhow, you can specofy what compression you want to have in the CREATE (table/index) and the data will be compressed from the start. Note, however, the page compression kicks in whan a page is full and targeted for more rows. And for columnstore, SQL server batches rows in batches of 1,000,000 rows. I.e., you do 1,000,000 inserts which will be done under the covers as uncompressed and then after 1,000,000 rows they are copressed into a compressed rowgroup.

There are lots and lots to say about columnstore. I suggest you prepare to spend some time reading up on it. Two starting points are:
http://www.sqlservercentral.com/stairway/121631/
http://www.nikoport.com/columnstore/

Andreas ss 726 Reputation points

2020-11-10T19:00:47.733+00:00

Thank you, I keep reading on the: http://www.sqlservercentral.com/stairway/121631

The database will only insert values and never be deleted/updated. I thought of try it out and then test some queries to get an understanding of how fast/slow the queries can take.

You mention to create a clustered columnstore index. I can never really understand how I will create this table. I have tried to search to get a good example in my case.
That sounded good, what you mention about it is possible to specify what compression to use when CREATE (table/index).

Would it be possible to show how to create this table, if using the columns that I have in my first post. I have not fully understood the differences of non-clustered columnstore indexes VS clustered columnstore indexes.

But if it is possible to just do a code for CREATE (table/index) using clustered columnstore indexes, it would be helpful to get a start I think?

Share via

How to use the GZIP compression while inserting multiple rows at the same time

4 additional answers

Your answer