Why no hash surrogate primary key for hubs in Synapse Analytics ?

kosmos 246 Reputation points
2022-10-23T14:08:59.127+00:00

According to the Microsoft White Paper (page 13), Synapse Analytics as a data warehouse solution underperforms when using hash surrogate primary keys or surrogate keys for hubs and links. Instead it is recommended to use the natural keys as distribution keys.

Literally from the whitepaper:
"It is a common (but not mandatory) practice for Data Vault to use a hash surrogate primary key for hubs and links... This isn’t the case for Azure Synapse Analytics. It performs better if
you avoid hashing and using a surrogate key: instead you should
use the natural key..."

However the paper does not explain why.

Lindset in his book of Data Vault, justifies the use of Hash Keys.

I would like to understand better why using hash keys generates underperformance in Azure Synapse Analytics.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,916 questions
{count} votes

Accepted answer
  1. Wang, Philippe 76 Reputation points
    2022-10-26T14:52:03.783+00:00

    Hello,

    To my understanding, (page 13) of the White Paper refers to a distribution strategy. You would always use a hash function for distribution as it randomly distribute data in buckets for parallel processing across nodes.

    https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

    The paper, which I have to agree is a bit confusing on this part, probably advise to use a hash distribution over a natural key instead of an artificial surrogate key (auto increment identity). The reason would be, as stated in the synapse documentation link provided above :

    "Since identical values always hash to the same distribution, SQL Analytics has built-in knowledge of the row locations. In dedicated SQL pool this knowledge is used to minimize data movement during queries, which improves query performance."

    So instead of hashing an auto incremental surrogate key which acts just like a row number, it is better to hash a natural key, especially if your table has many duplicate data.

    However, if your data is clean, I doubt it makes any difference. The key point is to use a column with many unique values for distribution, and which is not a date.

    This doesn't affect the use of surrogate keys in the Data Vault, which can be auto increments surrogate key, MD5 hash surrogate key (not the same than the distribution hash referred in the paper) in the new Data Vault 2.0 approach, or natural keys. All three methods are possible.

    Best,
    Philippe


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.