Why hash partitioning on a table in synapse dedicated pool (SQL DW) is skewing data?

Acharya, Tapan 0 Reputation points
2023-10-02T16:25:40.98+00:00

Hi there,

I am creating a synapse table having around 850 million (Data size around 50GB)records. I am doing hash distribution on this table base on a key (example order No). We have data in this table from difference source system. When I only do a hash distribution on this table (No partitioning at all) the data is not skewed at all and performance of my report which are on top of this table are pretty good.

But when I have further partition this table base on source system along with hash distribution the data is skewed and query performance from reports are really bad.

I am wondering why SQL DW is behaving this way. My thought was if I do partitioning on source system then queries should perfom better as this is used in some of the filter criterias.

Thanks,

Tapan.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,473 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Amira Bedhiafi 16,231 Reputation points
    2023-10-03T13:14:01.2333333+00:00

    Hash distribution in Azure Synapse Analytics is designed to evenly distribute data across all the distributions based on a hash key. When you have a single hash distribution, SQL DW will hash the values of your key (for example order no) and distribute the rows as uniformly as possible across all distributions, which minimizes data movement during query execution and optimizes performance.

    You can check this link : https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/multi-column-distribution-for-dedicated-sql-pools-is-now-ga/ba-p/3774529#:~:text=What%20is%20a%20hash%20distributed%20table%3F%20Dedicated%20SQL,data%20improving%20query%20performance%20on%20large%20fact%20tables.

    However, when you introduce an additional layer of partitioning based on the source system, the partitions associated with those systems will be larger. Combined with the hash distribution, this can lead to uneven data distribution.

    Another detail, if the distribution of order numbers isn't uniform across source systems, you might end up with certain distributions having more data for specific source systems and this can introduce skew.

    Imagine the scenario then, the engine has to consider both the hash distribution and the partitioning. If your query doesn't specifically benefit from the partitioning scheme then the partitioning may introduce unnecessary overhead.

    1 person found this answer helpful.
    0 comments No comments