DWH Modeling - Fact table PK - VARCHAR(23) vs DECIMAL(23,0)

Question

Hi guys,

I'm modeling a fact table that has a document number as a degenerate dimension. This number on the source table is a varchar(23) but it's a garanteed string of numbers from 0 to 9 on each character. Since this can be used logically as a primary key of the fact table, I went to check if the sql server int types cover this, but they do not. The biggest int, the BIGINT only covers 19 digits long.

I then noticed I could use a Decimal(23,0). This would help save some space as the varchar(23) has aproximately 25 (each char * 1 byte) + 2 bytes ( the lenght info).
The decimal(23,0) stores data with precision 20-28 as 13 bytes which is the half of it. Customer also told me this document number is changing to 15 chars, but the historical numbers will remain as 23. This means that even with 15 characters, the same decimal(23,0), since it has a fixed precision, will keep it as 13 bytes vs 17 bytes. It's still a gain of 4 bytes, on a 150 million table, which will save >550MB from day 1, and counting.

In the fact table, these records will be updated (existing records) / inserted (new) via SQL Merge, on this PK, so I'm thinking it's worth to change it to a decimal (23,0) instead of the varhcar (23,0), although I would like to hear some opinions and brainstorm a little.

Thank you

Credit to where it's due: https://www.sqlshack.com/understanding-sql-decimal-data-type/

Accepted Answer

First question: can there be leading zeroes in those document numbers? If so, decimal(23, 0) is out of the game and the only question is whether you should have char(23) or varchar(23). char(23) would be the obvious choice if they were to stick with these 23-digit numbers, but with the change to 15 digits, it seems that varchar is the better choice.

Even if there can be no leading zeroes, my gut feeling says character, because to me this feels like a string. But, yes, you will save some space with decimal.

If you go for varchar, you can consider to use a binary collation to speed operations on this column.

Answer

Hi @Erland Sommarskog ,

That was interesting. I found 148 old, historical, records in 2004 with a leading 0. Although, I think you are right, and I even think I should shield this right now before this number logic turns into something different, like backslashes, pipes, or some other char logic, in the future.
As for the binary collation I'm forced with a linguistic collation for now: Latin1_General_CI_AS, due to customer organisation policies.

Answer

In my case, it will only be used on the merge during the daily batch ETL. It is however the PK of the fact table.

MERGE [DWH].[FCT_CHARGE_DOCUMENT] AS TARGET
USING [STG].[T_CHARGE_DOCUMENT] AS SOURCE
ON SOURCE.DOCUMENT_NUMBER = TARGET.DOCUMENT_NUMBER

(...)

Analytically, this is always aggregated per time period.

DWH Modeling - Fact table PK - VARCHAR(23) vs DECIMAL(23,0)

2 additional answers