Azure Pipeline optimization: cluster setup time in Azure Synapse/Data factory

Shreyash Choudhary 126 Reputation points
2023-02-16T06:19:19.7166667+00:00

Any idea/thoughts on how to decrease cluster setup time in Azure Synapse/Data factory.

Currently it's taking almost 3 min on average for every mapping dataflow activity in pipeline (all dataflow have dependency so they are running sequentially via schedule trigger and between two dataflow i have one copy activity).

Any idea/thoughts on optimization of Azure pipeline will be helpful. Thanks, in advance.

Azure Synapse Analytics
Azure Synapse Analytics
An Azure analytics service that brings together data integration, enterprise data warehousing, and big data analytics. Previously known as Azure SQL Data Warehouse.
5,375 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,625 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator
    2023-02-27T20:49:29.3266667+00:00

    Hello @Shreyash Choudhary,

    Welcome to the MS Q&A platform.

    Please correct me if my understanding is wrong. You want to know how to reduce the cluster start-up time in ADF/Synapse dataflow.

    You can decrease the cluster start-up time by using the time to live (TTL) feature.

    Cluster start-up time is the time it takes to spin up an Apache Spark cluster. This value is located in the top-right corner of the monitoring screen. Data flows run on a just-in-time model where each job uses an isolated cluster. This start-up time generally takes 3-5 minutes. For sequential jobs, this can be reduced by enabling a time to live value.

    For more information, refer to the Time to live section in Integration Runtime performance.

    You can also optimize the performance of your data flows by using the Optimize tab in the data flow transformations. The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can affect your overall data flow performance.

    Suppose you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs. In that case, you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose" mode (default), you request the service to fully log activity at each individual partition level during your data transformation. This can be expensive, so only enabling verbose when troubleshooting can improve your overall data flow and pipeline performance.

    Reference document:

    https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/data-factory/concepts-data-flow-performance.md

    I hope this helps. Please let me know if you have any further questions.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.