Synapse Serverless SQL pool, first query takes longer

Kemla Mukul 20 Reputation points
2024-04-15T07:04:40.6433333+00:00

While attempting to run a synapse query on an OPENROWSET with format "Delta", the first query takes 15-40 seconds extra compared to every other query run after this irrespective of the filepath the query is run on. Same behaviour is observed after about 5 minutes of having run the first query. Could you explain why this extra time is observed once in every 5 minutes, and if this time can be increased somehow. Attaching a dummy query of the same format.

select * from OPENROWSET(BULK ('abfss://dummyStorage@dummyStorage.dfs.core.windows.net/path/to/file'), FORMAT = 'delta') as pwi where pwi.x='y' order by pwi.orderingField offset 0 rows fetch next 1 rows only FOR JSON PATH, INCLUDE_NULL_VALUES

We've got a serverless SQL pool.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
4,382 questions
{count} votes

Accepted answer
  1. Amira Bedhiafi 15,446 Reputation points
    2024-04-15T10:19:34.3566667+00:00

    Let's understand that serverless architectures, like the one used in Azure Synapse Serverless SQL pools, do not have dedicated resources persistently running. Instead, resources are allocated dynamically on demand. When a query is executed after some inactivity, the system needs to allocate and start-up resources, including computing and memory. This initialization process, often referred to as a "cold start", can add overhead to the first query execution.

    Since your query accesses data stored in a Delta format on Azure Data Lake, the first query also has additional overhead due to the initialization of connections to the Data Lake and metadata retrieval necessary to interpret the Delta files. Delta Lake maintains transaction logs that track changes to the dataset, and accessing these logs to construct a consistent view of the data for the first time might be time-consuming.

    Remember also that the Azure Synapse Serverless SQL pool likely caches metadata of the files and datasets it accesses. After executing the first query, metadata such as file locations, schema information, and statistics might be cached, thereby speeding up subsequent queries. If the system is idle for a prolonged period (around 5 minutes in your case), these caches might be cleared, and thus, the initialization and metadata retrieval processes need to occur again with the next query.

    Regarding increasing the time before a "cold start" occurs again, this capability is generally controlled by Azure and depends on its internal management and optimization algorithms. Serverless platforms prioritize efficient resource utilization, and keeping resources idle can be costly. However, Azure does not typically provide direct user control over how long resources or caches are retained idle in serverless environments.

    If you'd like to avoid these delays, you might consider using a provisioned SQL pool within Azure Synapse, where resources are dedicated and always running. This comes at the cost of losing the pay-per-use flexibility of serverless but provides predictable performance.

    0 comments No comments

0 additional answers

Sort by: Most helpful