Sink and transformation right number of partitions

D B 21 Reputation points
2020-10-21T19:41:17.283+00:00

Hi there.

I’m trying to improve the performance of my Data Flow in Azure Data Factory. Reading the Microsoft official documentation, I see that a plausible way to reach this is by setting the number of the partitions of my sinks and transformations after analyzing the first execution, and identifying the bottleneck of the process. Some months ago, I read something about how to calculate the right number of partitions. I’m not so sure, but the guide said it was necessary to multiply the number of cores for a specific factor (I think that it was for Round Robin). Unfortunately, I can not find this reference anymore.

What is the most accurate way to calculate this? I’m trying to do it dynamically based on the number of rows of the file, given that its size can differ a lot between executions.

Regards

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,560 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. MarkKromer-MSFT 5,211 Reputation points Microsoft Employee
    2020-10-22T06:23:43.513+00:00

    If your sink is a file based sink, then the most performant setting will be to just leave the partitioning to default or "current partitioning". This will allow Spark to determine the best number of partitions based on the number of cores in the worker nodes. It will allow your data flow to scale proportionally as you add cores to your integration runtime to scale-up your execution:

    https://techcommunity.microsoft.com/t5/azure-data-factory/use-azure-ir-to-tune-adf-and-synapse-data-flows/ba-p/1715269

    https://techcommunity.microsoft.com/t5/azure-data-factory/performance-tuning-adf-data-flow-sources-and-sinks/ba-p/1781804

    This is the latest updated ADF data flows performance guide: https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance

    I would only manually set the file sink partitioning if you want to control the partitioned file and folder structure (which takes time in processing) or if you intentionally wish to manually minimize the number of partitions that data flow can use.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.