Synapse Analytics: Querying Performance with Serverless SQL Pool and Partitioning

Question

Synapse Analytics: Querying Performance with Serverless SQL Pool and Partitioning

Ross Lyons 0

Hi,

I have followed the serverless SQL pool best practices documentation when setting up my Synapse Analytics workspace and data lake, particularly I am wondering about the section regarding querying specific files using file metadata, and the next step of querying partitioned parquet files.

Below is a slightly modified example taken from the previously linked document. Having adapted this to my own data and partitioning setup, I was expecting the number of partitions to have little impact on the performance of each query.

SELECT
        YEAR(tpepPickupDateTime),
        passengerCount,
        COUNT(*) AS cnt
FROM  
    OPENROWSET(
        BULK 'puYear=*/puMonth=*/*.snappy.parquet',
        DATA_SOURCE = 'YellowTaxi',
        FORMAT='PARQUET'
    ) nyc
WHERE
    nyc.filepath(1) = 2017
    AND nyc.filepath(2) IN (1, 2, 3)

However, as the number of partitions grows, the performance takes a massive hit, for example with only a few partitions each query takes less than 1s, however with > 30 partitions I am seeing queries exceed 10s.

When I apply the filters directly in the BULK string to target the specific partitions, the query performance consistently remains around 1 second, regardless of the number of partitions.

Is this expected behaviour when using the WHERE clause? Is there any guidance on achieving the results similar to when hard coding values in the BULK string?

For additional context, queries like the above are used to create SQL views that are used by external applications.

Thanks!

Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-06-20T06:41:10.8866667+00:00

@Ross Lyons Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

1 answer

Your answer

Venkat Reddy Navari 2,885 Reputation points Microsoft External Staff Moderator

2025-06-20T06:41:10.8866667+00:00

@Ross Lyons Following up to see if the below answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

Hi @Ross Lyons

Yes, this is expected behavior in Synapse Serverless SQL Pools.

Why you’re Seeing Performance Differences:

When you apply filters like nyc.filepath(1) = 2017 in the WHERE clause, Synapse first enumerates all matching files in the given BULK path (i.e., 'puYear=*/puMonth=*/*.parquet'), then filters the records that match your partition values — this still incurs the cost of scanning metadata for many partitions, even if no data is read from some files.

In contrast, embedding partition values directly into the BULK path (e.g., 'puYear=2017/puMonth=1/*.parquet') allows Synapse to skip file listing and metadata scans, resulting in much faster query performance.

Best Practices to Improve Performance:

Use Explicit Paths in the BULK Clause For optimal performance, construct dynamic SQL to build partition-specific BULK paths instead of relying on WHERE filtering. Example:
```
  SELECT *
  FROM OPENROWSET(
    BULK 'puYear=2017/puMonth=1/*.parquet',
    DATA_SOURCE = 'YellowTaxi',
    FORMAT='PARQUET'
  ) AS nyc
```
Avoid Wildcards in Production Queries/Views Wildcards (*) in folder structure force Synapse to list and scan more files, even if filtered later.
Create External Tables with Partition Elimination Consider using CREATE EXTERNAL TABLE with partitioned folders, then design your WHERE clauses to align with the table's partition columns. This enables partition pruning.
Materialize Complex Queries For frequent queries or BI apps, consider staging aggregated data or pre-joining heavy views as intermediate tables.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Share via

Synapse Analytics: Querying Performance with Serverless SQL Pool and Partitioning

1 answer

Your answer