Hi @Ross Lyons
Yes, this is expected behavior in Synapse Serverless SQL Pools.
Why you’re Seeing Performance Differences:
When you apply filters like nyc.filepath(1) = 2017
in the WHERE
clause, Synapse first enumerates all matching files in the given BULK
path (i.e., 'puYear=*/puMonth=*/*.parquet'
), then filters the records that match your partition values — this still incurs the cost of scanning metadata for many partitions, even if no data is read from some files.
In contrast, embedding partition values directly into the BULK
path (e.g., 'puYear=2017/puMonth=1/*.parquet'
) allows Synapse to skip file listing and metadata scans, resulting in much faster query performance.
Best Practices to Improve Performance:
- Use Explicit Paths in the BULK Clause For optimal performance, construct dynamic SQL to build partition-specific BULK paths instead of relying on
WHERE
filtering. Example:SELECT * FROM OPENROWSET( BULK 'puYear=2017/puMonth=1/*.parquet', DATA_SOURCE = 'YellowTaxi', FORMAT='PARQUET' ) AS nyc
- Avoid Wildcards in Production Queries/Views Wildcards (
*
) in folder structure force Synapse to list and scan more files, even if filtered later. - Create External Tables with Partition Elimination Consider using
CREATE EXTERNAL TABLE
with partitioned folders, then design yourWHERE
clauses to align with the table's partition columns. This enables partition pruning. - Materialize Complex Queries For frequent queries or BI apps, consider staging aggregated data or pre-joining heavy views as intermediate tables.
Hope this helps. Do let us know if you any further queries.
If this answers your query, do click Accept Answer
and Yes
for was this answer helpful. And, if you have any further query do let us know.