Reducing Data Scanning Overhead in Delta Format in Azure Synapse Analytics SQL Pool

Question

Reducing Data Scanning Overhead in Delta Format in Azure Synapse Analytics SQL Pool

Vinod Kumar Kapa 0

We've established a serverless SQL pool with a hierarchical folder structure partition and employed open row set to reduce data scanning when executing queries on Parquet file format from the serverless SQL pool.

Now, as we endeavor to utilize the delta format for queries within the serverless SQL pool, we've encountered significant scanning overhead and multiple entries in SQL requests.

Is it expected behavior with having delta format in server less sql pool.

What are the possible solutions to get reduce the data scan.

1 answer

Your answer

Answer 1

Hi @Vinod Kumar Kapa

Welcome to Microsoft Q&A platform and thanks for posting your question here.

According to the Azure documentation, querying Delta Lake format in serverless Synapse SQL pool is currently in public preview. This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. Therefore, it is possible to encounter significant scanning overhead and multiple entries in SQL requests when using the delta format for queries within a serverless SQL pool.

To reduce the data scan, you can follow the best practices for serverless SQL pool provided by Azure. Here are some possible solutions:

The data types you use in your query affect performance and concurrency. Use the smallest data size that can accommodate the largest possible value. If possible, use varchar and char instead of nvarchar and nchar.
Use PARSER_VERSION 2.0 to query Delta Lake files: You can use a performance-optimized parser when you query Delta Lake files.
Creating statistics for columns used in queries can improve query performance in Azure Synapse Analytics. The serverless SQL pool uses statistics to generate optimal query execution plans. While statistics are automatically created for some file types, they are not automatically created for Delta Lake files when using external tables. It's important to manually create statistics for Delta Lake files, especially for columns used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY clauses.
Optimizing the partition strategy in your data lake can improve query performance in Azure Synapse Analytics. To do this, you need to identify the file types in the data lake and the query performance SLA. You should also consider whether your use case demands predictable performance and cost, and whether you have unplanned or bursty SQL analytical workloads. It's important to identify the data consuming pattern and platforms. By doing so, you can ensure that you have a good partition strategy that meets your specific needs and improves query performance.
Using Spark pool assessment can help improve query performance in Azure Synapse Analytics. Spark pools enable data engineering and data preparation, and Apache Spark includes many optimization techniques that can improve query performance. By using Spark pool assessment, you can identify performance bottlenecks and optimize your Spark jobs to improve query performance.

Reference
https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/implementation-success-assess-environment

https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset#parser_version

https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics#statistics-in-serverless-sql-pool

I hope this information helps you. Let me know if you have any further questions or concerns.

Vinod Kumar Kapa 0 Reputation points

2024-04-24T13:39:34.13+00:00

Hi @Harishga

Thank you for your confirmation.

Based on your confirmation, the public preview entails utilizing the Lakehouse database to autonomously replicate tables in a serverless SQL pool using delta. Could you kindly confirm whether we can directly use the delta format in the serverless SQL pool when creating external tables and views? if yes what are the best methods to improve the performance of the delta format files in server less sql pool.
Harishga 6,005 Reputation points Microsoft External Staff

2024-04-24T14:47:56.94+00:00

Hi @Vinod Kumar Kapa

Yes, you can directly use the Delta format in the serverless SQL pool when creating external tables and views. To improve the performance of Delta format files in the serverless SQL pool, you can consider the following methods:

Partitioning the data can help reduce the amount of data scanned during queries, which can improve query performance. You can partition the data based on columns that are frequently used in queries.

Clustering the data can help group similar data together, which can improve query performance. You can cluster the data based on columns that are frequently used in queries.

You can use the cache feature in the serverless SQL pool to cache frequently accessed data, which can improve query performance.

You can create indexes on columns that are frequently used in queries to improve query performance.

Reference
https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/query-delta-lake-format
I hope this information helps you. Let me know if you have any further questions or concerns.
Vinod Kumar Kapa 0 Reputation points

2024-04-25T12:55:15.8733333+00:00

Hi @Harishga

Thank you for your confirmation.

I've confirmed that features like Clustering, cache, and indexes are not accessible in a serverless SQL pool.

Share via

Reducing Data Scanning Overhead in Delta Format in Azure Synapse Analytics SQL Pool

1 answer

Your answer