Synapse Analytics: Querying Performance with Serverless SQL Pool and Partitioning

Ross Lyons 0 Reputation points
2025-06-18T11:59:34.7466667+00:00

Hi,

I have followed the serverless SQL pool best practices documentation when setting up my Synapse Analytics workspace and data lake, particularly I am wondering about the section regarding querying specific files using file metadata, and the next step of querying partitioned parquet files.

Below is a slightly modified example taken from the previously linked document. Having adapted this to my own data and partitioning setup, I was expecting the number of partitions to have little impact on the performance of each query.

SELECT
        YEAR(tpepPickupDateTime),
        passengerCount,
        COUNT(*) AS cnt
FROM  
    OPENROWSET(
        BULK 'puYear=*/puMonth=*/*.snappy.parquet',
        DATA_SOURCE = 'YellowTaxi',
        FORMAT='PARQUET'
    ) nyc
WHERE
    nyc.filepath(1) = 2017
    AND nyc.filepath(2) IN (1, 2, 3)

However, as the number of partitions grows, the performance takes a massive hit, for example with only a few partitions each query takes less than 1s, however with > 30 partitions I am seeing queries exceed 10s.

When I apply the filters directly in the BULK string to target the specific partitions, the query performance consistently remains around 1 second, regardless of the number of partitions.

Is this expected behaviour when using the WHERE clause? Is there any guidance on achieving the results similar to when hard coding values in the BULK string?

For additional context, queries like the above are used to create SQL views that are used by external applications.

Thanks!

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,373 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Smaran Thoomu 24,110 Reputation points Microsoft External Staff Moderator
    2025-06-19T02:28:43.45+00:00

    Hi @Ross Lyons

    Yes, this is expected behavior in Synapse Serverless SQL Pools.

    Why you’re Seeing Performance Differences:

    When you apply filters like nyc.filepath(1) = 2017 in the WHERE clause, Synapse first enumerates all matching files in the given BULK path (i.e., 'puYear=*/puMonth=*/*.parquet'), then filters the records that match your partition values — this still incurs the cost of scanning metadata for many partitions, even if no data is read from some files.

    In contrast, embedding partition values directly into the BULK path (e.g., 'puYear=2017/puMonth=1/*.parquet') allows Synapse to skip file listing and metadata scans, resulting in much faster query performance.

    Best Practices to Improve Performance:

    • Use Explicit Paths in the BULK Clause For optimal performance, construct dynamic SQL to build partition-specific BULK paths instead of relying on WHERE filtering. Example:
        SELECT *
        FROM OPENROWSET(
          BULK 'puYear=2017/puMonth=1/*.parquet',
          DATA_SOURCE = 'YellowTaxi',
          FORMAT='PARQUET'
        ) AS nyc
      
    • Avoid Wildcards in Production Queries/Views Wildcards (*) in folder structure force Synapse to list and scan more files, even if filtered later.
    • Create External Tables with Partition Elimination Consider using CREATE EXTERNAL TABLE with partitioned folders, then design your WHERE clauses to align with the table's partition columns. This enables partition pruning.
    • Materialize Complex Queries For frequent queries or BI apps, consider staging aggregated data or pre-joining heavy views as intermediate tables.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.