Why hash partitioning on a table in synapse dedicated pool (SQL DW) is skewing data?

Question

Why hash partitioning on a table in synapse dedicated pool (SQL DW) is skewing data?

Acharya, Tapan 0

Hi there,

I am creating a synapse table having around 850 million (Data size around 50GB)records. I am doing hash distribution on this table base on a key (example order No). We have data in this table from difference source system. When I only do a hash distribution on this table (No partitioning at all) the data is not skewed at all and performance of my report which are on top of this table are pretty good.

But when I have further partition this table base on source system along with hash distribution the data is skewed and query performance from reports are really bad.

I am wondering why SQL DW is behaving this way. My thought was if I do partitioning on source system then queries should perfom better as this is used in some of the filter criterias.

Thanks,

Tapan.

Harishga 6,005 Reputation points Microsoft External Staff

2023-10-04T14:05:53.15+00:00
Hi@Acharya, Tapan ,
Welcome to Microsoft Q&A platform and thanks for posting your question here.
When you use hash partitioning on a table in Synapse dedicated pool (SQL DW), it can skew the data if the distribution of the partitioning key is not uniform across the source systems. This can lead to uneven data distribution and poor query performance. When you introduce an additional layer of partitioning based on the source system, the partitions associated with those systems will be larger. Combined with the hash distribution, this can lead to uneven data distribution and processing skew.

To avoid data skew, you can follow some best practices for designing distributed tables in Synapse dedicated pool. These include:

Choose the right distribution strategy based on the data and query patterns. Hash distribution is suitable for large fact tables, while round-robin distribution is suitable for small tables or tables with no clear distribution key.

Select a distribution column or set of columns that lead to an even spread of data across each distribution and thus to minimum data skew.

Use multi-column distribution to distribute data based on multiple columns.

Use partitioning to further divide the data into smaller chunks based on a partitioning key. However, avoid excessive partitioning, as it can reduce the effectiveness of clustered column store indexes.

Monitor the impact of skew on query performance and resolve data skew as needed by re-creating the table with a different distribution column(s).

By following these best practices, you can design distributed tables that minimize data skew and improve query performance in Synapse dedicated pool.
Reference links:

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql-data-warehouse/sql-data-warehouse-tables-distribute

https://adatis.co.uk/skewness-in-hash-distributed-tables/

https://azureaggregator.wordpress.com/2023/02/17/azure-synapse-analytics-dedicated-sql-pool-data-modelling-best-practices/

I hope this information helps you. Let me know if you have any further questions or concerns.
Harishga 6,005 Reputation points Microsoft External Staff

2023-10-05T16:01:09.7966667+00:00

Hi@Acharya, Tapan ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Harishga 6,005 Reputation points Microsoft External Staff

2023-10-06T13:09:41.64+00:00

Hi@Acharya, Tapan ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

1 answer

Your answer

Harishga 6,005 Reputation points Microsoft External Staff

2023-10-05T16:01:09.7966667+00:00

Hi@Acharya, Tapan ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.
Harishga 6,005 Reputation points Microsoft External Staff

2023-10-06T13:09:41.64+00:00

Hi@Acharya, Tapan ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution, please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Answer 1

Hash distribution in Azure Synapse Analytics is designed to evenly distribute data across all the distributions based on a hash key. When you have a single hash distribution, SQL DW will hash the values of your key (for example order no) and distribute the rows as uniformly as possible across all distributions, which minimizes data movement during query execution and optimizes performance.

You can check this link : https://techcommunity.microsoft.com/t5/azure-synapse-analytics-blog/multi-column-distribution-for-dedicated-sql-pools-is-now-ga/ba-p/3774529#:~:text=What%20is%20a%20hash%20distributed%20table%3F%20Dedicated%20SQL,data%20improving%20query%20performance%20on%20large%20fact%20tables.

However, when you introduce an additional layer of partitioning based on the source system, the partitions associated with those systems will be larger. Combined with the hash distribution, this can lead to uneven data distribution.

Another detail, if the distribution of order numbers isn't uniform across source systems, you might end up with certain distributions having more data for specific source systems and this can introduce skew.

Imagine the scenario then, the engine has to consider both the hash distribution and the partitioning. If your query doesn't specifically benefit from the partitioning scheme then the partitioning may introduce unnecessary overhead.

Share via

Why hash partitioning on a table in synapse dedicated pool (SQL DW) is skewing data?

1 answer

Your answer