Performance of Filtering on Non-Partition Columns in Azure Serverless SQL

Kyle1245 420 Reputation points
2023-12-13T19:50:54.2733333+00:00

I comprehend the performance benefits achieved when filtering data using filepath(n) or employing wildcards in the file path.

However, I'm interested in understanding the performance implications when querying data and filtering based on columns other than the partition columns. How much slower could this process be compared to filtering on partition columns?

My assumption is that the engine or Synapse must scan all files in all folders. Considering that a serverless dataset lacks indexes, I also anticipate that the process would be considerably slower in comparison.

To enhance performance, would it be advisable to adopt the practice of organizing files into multiple folders, perhaps forming a 'long' path like pc1/.../pcN/npc1/.../npcM, where each npc* folder represents a non-partition column later used for filtering?

I would greatly appreciate any insights or specific references on this matter.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,431 questions
{count} votes

Accepted answer
  1. phemanth 15,235 Reputation points Microsoft External Staff
    2023-12-14T09:24:50.8633333+00:00

    @Kyle1245

    Thank you for using Microsoft Q&A platform and thanks for posting your query.

    You should Consider the queries that you'll be running frequently and partition your data accordingly. If you have specific non-partition columns that you filter on regularly, you may want to consider restructuring your data to make those columns part of the partitioning scheme.

    Organizing files into many subfolders based on non-partition columns (as you suggested with a "long" path pc1/.../pcN/npc1/.../npcM) could potentially improve performance for specific queries. However, this approach has limitations of complexity can become complex to manage. It may create more partitions than necessary, which might actually hinder performance in some cases. You'd have to carefully balance the granularity of the partitions with query patterns and data volume.

    Also, in serverless your cost if based on data reads, so you will definitely want to reduce the amount of data being read by avoiding scans as much as possible. Ensuring that only part of your data is read which is relevant for the query.

    Scans are very costly in terms of performance. But actual scan performance will depend on how much data is required to be read.

    Edited to Update More Regarding Joins.

    Parquet is a columnar storage format. This means that reading a specific column's data across many rows is faster than reading many columns for a specific row. If your JOIN operation predominantly filters or scans specific columns, Parquet's structure will inherently provide some performance benefits.

    When data is partitioned, the query engine can skip entire partitions that are irrelevant to a query. This is particularly beneficial for filter operations, not JOINs specifically. Partitioning on columns that you're joining on can improve performance if it allows the query engine to skip large amounts of irrelevant data.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

    .


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.