Reducing Data Scanning Overhead in Delta Format in Azure Synapse Analytics SQL Pool

Vinod Kumar Kapa 0 Reputation points
2024-04-23T20:04:03.4333333+00:00

We've established a serverless SQL pool with a hierarchical folder structure partition and employed open row set to reduce data scanning when executing queries on Parquet file format from the serverless SQL pool.

Now, as we endeavor to utilize the delta format for queries within the serverless SQL pool, we've encountered significant scanning overhead and multiple entries in SQL requests.

Is it expected behavior with having delta format in server less sql pool.

What are the possible solutions to get reduce the data scan.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,378 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Harishga 6,005 Reputation points Microsoft External Staff
    2024-04-24T05:28:18.5133333+00:00

    Hi @Vinod Kumar Kapa

    Welcome to Microsoft Q&A platform and thanks for posting your question here.

    According to the Azure documentation, querying Delta Lake format in serverless Synapse SQL pool is currently in public preview. This preview version is provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. Therefore, it is possible to encounter significant scanning overhead and multiple entries in SQL requests when using the delta format for queries within a serverless SQL pool.

    To reduce the data scan, you can follow the best practices for serverless SQL pool provided by Azure. Here are some possible solutions:

    • The data types you use in your query affect performance and concurrency. Use the smallest data size that can accommodate the largest possible value. If possible, use varchar and char instead of nvarchar and nchar.  
    • Use PARSER_VERSION 2.0 to query Delta Lake files: You can use a performance-optimized parser when you query Delta Lake files.
    • Creating statistics for columns used in queries can improve query performance in Azure Synapse Analytics. The serverless SQL pool uses statistics to generate optimal query execution plans. While statistics are automatically created for some file types, they are not automatically created for Delta Lake files when using external tables. It's important to manually create statistics for Delta Lake files, especially for columns used in DISTINCT, JOIN, WHERE, ORDER BY, and GROUP BY clauses.
       
    • Optimizing the partition strategy in your data lake can improve query performance in Azure Synapse Analytics. To do this, you need to identify the file types in the data lake and the query performance SLA. You should also consider whether your use case demands predictable performance and cost, and whether you have unplanned or bursty SQL analytical workloads. It's important to identify the data consuming pattern and platforms. By doing so, you can ensure that you have a good partition strategy that meets your specific needs and improves query performance.
    • Using Spark pool assessment can help improve query performance in Azure Synapse Analytics. Spark pools enable data engineering and data preparation, and Apache Spark includes many optimization techniques that can improve query performance. By using Spark pool assessment, you can identify performance bottlenecks and optimize your Spark jobs to improve query performance.

    Reference
    https://learn.microsoft.com/en-us/azure/synapse-analytics/guidance/implementation-success-assess-environment

    https://docs.microsoft.com/en-us/azure/synapse-analytics/sql/develop-openrowset#parser_version

    https://learn.microsoft.com/en-us/azure/synapse-analytics/sql/develop-tables-statistics#statistics-in-serverless-sql-pool

    I hope this information helps you. Let me know if you have any further questions or concerns.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.