How could I compress this table in the most optimal way

Andreas ss 726 Reputation points
2020-11-06T22:35:12.44+00:00

Hello!

I have a question about how to compress a table in a database. I have red alot about it but it seems to depend alot of the structure of the database and what type of columns that exists in the tables etc. So I get a bit confused of what could be a an optimal way to save as much space as possible.

I think my main concern is space so I will try to focus to make the table as small in memory as possible. The database consists 99.99% of the exact below table. I will have thousands of this table and the database could take up as much as 28 TB in the end etc.

My processor/table looks/works like below if this helps in a decision for what type of compression technique that could be a good idéa to use (row, page etc?).

  1. The table will add new data but the data will Never be deleted. There will be added added approx around 100 rows each day. This is how the table will add data.
  2. The operations that I will defenitevely will do most is query operations of this table which will be Countless of queries. Lets say millions and millions of times over and over again, if that gives any descent clue.
  3. My computer use 64 cores in the processor (I know some compression was more CPU intense. So that should be okay since I have many cores and the Enterprise Edition of SQL which can use all cores)

I am not sure when and how I will compress a table. Will I first build the table completely and in the end compress it. How does this work exactly?
Then when it is compressed. Must I read the table using queries in a different way then usual?

I don't know how this is done there, so I have 2-3 questions at the same time where I don't know where to begin.

CREATE TABLE [dbo].[table123] (
    [_DateTime]        SMALLDATETIME DEFAULT (getdate()) NOT NULL,
    [_DayNr]             TINYINT               DEFAULT ((0)) NOT NULL,
    [_CategoryNbr]   TINYINT              DEFAULT ((0)) NOT NULL,
    [_FeatureNbr]     SMALLINT           DEFAULT ((-1)) NOT NULL,
    [_Value]               FLOAT (53)                                          NULL,
    [_Bool]                 BIT                                                      NULL,
    CONSTRAINT [PK_table123] PRIMARY KEY CLUSTERED ([_DayNr] ASC, [_DateTime] ASC, [_CategoryNbr] ASC, [_FeatureNbr] ASC),
    CONSTRAINT [UC_table123] UNIQUE NONCLUSTERED ([_FeatureNbr] ASC, [_DateTime] ASC)
);

Thank you!
SQL Server
SQL Server
A family of Microsoft relational database management and analysis systems for e-commerce, line-of-business, and data warehousing solutions.
13,670 questions
{count} votes

Accepted answer
  1. Erland Sommarskog 110.3K Reputation points
    2020-11-08T10:56:50.037+00:00

    The tables contain stock symbol data where each row has data for a specific date and time. So tables is for: Each month. For example, symbol Microsoft(MSFT). For year 2010 and month January which is: 1

    MSFT20101

    This saves 2 colums of data in the table. Year and Month column. Make queries faster by knowing directly in what Year and Month table I need to extract information

    You will save some space compared to a single table where you you store the data in a traditional B-Tree format, since symbol and month is not repeated. But in execution time? Not much, since the index permits you to efficiently access the rows for that month.

    If you store the data in columnstore format, there is not much space spend on the month and symbol because of the format. But access may be slower.

    As for questions "how much extra time" etc, I only have one answer: you need to benchmark. There are no simple answers here.

    The compression models I listed here, permits you to query the data in compressed format. If all you are going to do is to extract the values to a text file, I don't see the need to uncompress at all.

    But there are of course more extreme versions. Rather than storing the information row-by-row, take all the stock data for MSFT Jan 2010 and stick into to your own string and store that string compressed. Or in your own binary format. Maybe even applying your own compression...

    You could spend just as much time on finding the best compression model as you spend on your random forest. If not more.


1 additional answer

Sort by: Most helpful
  1. Erland Sommarskog 110.3K Reputation points
    2020-11-07T19:34:56.593+00:00

    Why thousands of tables? Why not a single table, assuming all tables have the same schema? If nothing else, that will make compression more efficient.

    Your question is impossible to answer, because there are so many ifs and buts. And tradeoffs. But these are your option:

    row-compression - this is basically only an alternate storage format, and much of compression, but it can save some space with some data types. With the schema you have, I don't think there will be any difference at all.

    page-compression - This gives a better compression, but there is also in cost for reading and writing data.

    columnstore - this is a different storage format where data is stored by columnwise instead of rowwise. This permits for big compression rates, better than page compression. For this format to make sense to you, you would have a single table not thousands. Columnstore is intended for data warehousing - you would not use columnstore for an OLTP system. Where you simulation things qualifies, I don't know. If you need to do a lot of point lookup, columnstore will not work well for you, I think. But if you expect to scan a lot, it is a winner, since the compression means that there is less to scan.

    columnstore_archive - this is a variation of columnstore that gives the best compression, but i guess it is slower to read.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.