Azure Pipeline optimization: cluster setup time in Azure Synapse/Data factory

Question

Azure Pipeline optimization: cluster setup time in Azure Synapse/Data factory

Shreyash Choudhary 126

Any idea/thoughts on how to decrease cluster setup time in Azure Synapse/Data factory.

Currently it's taking almost 3 min on average for every mapping dataflow activity in pipeline (all dataflow have dependency so they are running sequentially via schedule trigger and between two dataflow i have one copy activity).

Any idea/thoughts on optimization of Azure pipeline will be helpful. Thanks, in advance.

Shreyash Choudhary 126 Reputation points

2023-02-17T10:50:25.01+00:00

Hi BhargavaGunnam-MSFT ,

Yes, I want to reduce cluster startup time, i have pipelines which contains dataflows and copy activity, how to reduce? I checked auto resolve integration runtime and tried to edit but not able to do so, can't change coz nothing was there to edit!!

1 have 10 copy and 10 dataflows activities sin separate pipeline.

like Copy->dataflow->copy->dataflow same and so on for rest,

all have dependency, so running sequentially.

when scheduled job on master pipeline which have all of this above sequence pipeline, for every dataflow it's taking 3 min arvg. just for cluster setup.

I have seen TTL field is 60 min, till how much i can reduce TTL like 15min or 20min what will be the effect? currently its 60min, what TTL time should i use? and how to edit this?

i have also seen activity in QUEUED it also takes almost 1-2 min just for QUEUED, how to optimizes it?

also, if for logging -> verbose is removed and choose none or basic, and if error happened in activity, i am using activity('abcd').error.message in next activity to capture what's the error, so this expression will still work right or not?

i don't have much transformation in dataflow i am just processing 100-200 records, same with copy activity, so inside dataflow can't do any optimization for such low no. of records.

Please tell how i can optimize more apart from this, just for 10-15 records it's taking more than 1.30 hours for full pipeline to run, it will be great.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-22T00:06:02.9233333+00:00

Hello @Shreyash Choudhary,

The default value for TTL is 60 minutes, but you can reduce it to 15 or 20 minutes. This will keep the cluster alive for a that period after its execution completes. If a new job starts using the IR during the TTL time, it will reuse the existing cluster and start-up time will be greatly reduced.

You can't change TTL on Auto resolve IR, and the value is Zero always.

If you leave the TTL to 0, ADF will always spawn a new Spark cluster environment for every Data Flow activity that executes. This means that an Azure Databricks cluster is provisioned each time and takes about 5-7 minutes to become available and execute your job.

To use this TTL, you need to use Azure IR.

Here is a video tutorial explained about this setting.

Tech community document: https://techcommunity.microsoft.com/t5/azure-data-factory-blog/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380

I hope this helps. Please let me know if you have any further questions.
Shreyash Choudhary 126 Reputation points

2023-02-22T09:24:23.1+00:00

Hi BhargavaGunnam-MSFT,

i have only 1 lakh rows, but data read & written is confusing me, i don't have staging enabled in copy activity, how to optimize the copy activity further ,duration for this activity was 2 hours alone!!!, above pic is for reference
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-23T20:36:58.2033333+00:00

Hello @Shreyash Choudhary,

Here are some key factors you can consider to improve the copy activity performance.

Data pattern and batch size: The table schema affects copy throughput. To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data. Copy Activity inserts data in a series of batches. You can set the number of rows in a batch using the writeBatchSize property.

If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. If the row size of your data is large, be cautious when you increase writeBatchSize. A high value might lead to a copy failure caused by overloading the database.

Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently.

Considerations for the sink: Ensure that the underlying data store is not overwhelmed by other workloads running on or against it.

Ex: If you are copying data from Blob storage to Azure Synapse Analytics, consider using PolyBase to boost performance.

File format and compression: Consider using a file format and compression optimized for your data.

Relational data stores: Depending on the properties you've set for sqlSink, Copy Activity writes data to the destination database in different ways. By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance.

Reference document: https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/data-factory/v1/data-factory-copy-activity-performance.md

I hope this helps. Please let me know if you have any further questions.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-24T19:21:14.11+00:00

Hi Shreyash Choudhary,

I am checking to see if you have any further questions here.

1 answer

Your answer

Shreyash Choudhary 126 Reputation points

2023-02-17T10:50:25.01+00:00

Hi BhargavaGunnam-MSFT ,

Yes, I want to reduce cluster startup time, i have pipelines which contains dataflows and copy activity, how to reduce? I checked auto resolve integration runtime and tried to edit but not able to do so, can't change coz nothing was there to edit!!

1 have 10 copy and 10 dataflows activities sin separate pipeline.

like Copy->dataflow->copy->dataflow same and so on for rest,

all have dependency, so running sequentially.

when scheduled job on master pipeline which have all of this above sequence pipeline, for every dataflow it's taking 3 min arvg. just for cluster setup.

I have seen TTL field is 60 min, till how much i can reduce TTL like 15min or 20min what will be the effect? currently its 60min, what TTL time should i use? and how to edit this?

i have also seen activity in QUEUED it also takes almost 1-2 min just for QUEUED, how to optimizes it?

also, if for logging -> verbose is removed and choose none or basic, and if error happened in activity, i am using activity('abcd').error.message in next activity to capture what's the error, so this expression will still work right or not?

i don't have much transformation in dataflow i am just processing 100-200 records, same with copy activity, so inside dataflow can't do any optimization for such low no. of records.

Please tell how i can optimize more apart from this, just for 10-15 records it's taking more than 1.30 hours for full pipeline to run, it will be great.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-22T00:06:02.9233333+00:00

Hello @Shreyash Choudhary,

The default value for TTL is 60 minutes, but you can reduce it to 15 or 20 minutes. This will keep the cluster alive for a that period after its execution completes. If a new job starts using the IR during the TTL time, it will reuse the existing cluster and start-up time will be greatly reduced.

You can't change TTL on Auto resolve IR, and the value is Zero always.

If you leave the TTL to 0, ADF will always spawn a new Spark cluster environment for every Data Flow activity that executes. This means that an Azure Databricks cluster is provisioned each time and takes about 5-7 minutes to become available and execute your job.

To use this TTL, you need to use Azure IR.

Here is a video tutorial explained about this setting.

Tech community document: https://techcommunity.microsoft.com/t5/azure-data-factory-blog/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380

I hope this helps. Please let me know if you have any further questions.
Shreyash Choudhary 126 Reputation points

2023-02-22T09:24:23.1+00:00

Hi BhargavaGunnam-MSFT,

i have only 1 lakh rows, but data read & written is confusing me, i don't have staging enabled in copy activity, how to optimize the copy activity further ,duration for this activity was 2 hours alone!!!, above pic is for reference
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-23T20:36:58.2033333+00:00

Hello @Shreyash Choudhary,

Here are some key factors you can consider to improve the copy activity performance.

Data pattern and batch size: The table schema affects copy throughput. To copy the same amount of data, a large row size gives you better performance than a small row size because the database can more efficiently commit fewer batches of data. Copy Activity inserts data in a series of batches. You can set the number of rows in a batch using the writeBatchSize property.

If your data has small rows, you can set the writeBatchSize property with a higher value to benefit from lower batch overhead and higher throughput. If the row size of your data is large, be cautious when you increase writeBatchSize. A high value might lead to a copy failure caused by overloading the database.

Query or stored procedure: Optimize the logic of the query or stored procedure you specify in the Copy Activity source to fetch data more efficiently.

Considerations for the sink: Ensure that the underlying data store is not overwhelmed by other workloads running on or against it.

Ex: If you are copying data from Blob storage to Azure Synapse Analytics, consider using PolyBase to boost performance.

File format and compression: Consider using a file format and compression optimized for your data.

Relational data stores: Depending on the properties you've set for sqlSink, Copy Activity writes data to the destination database in different ways. By default, the data movement service uses the Bulk Copy API to insert data in append mode, which provides the best performance.

Reference document: https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/data-factory/v1/data-factory-copy-activity-performance.md

I hope this helps. Please let me know if you have any further questions.
Bhargava-MSFT 31,261 Reputation points Microsoft Employee Moderator

2023-02-24T19:21:14.11+00:00

Hi Shreyash Choudhary,

I am checking to see if you have any further questions here.

Answer 1

Hello @Shreyash Choudhary,

Welcome to the MS Q&A platform.

Please correct me if my understanding is wrong. You want to know how to reduce the cluster start-up time in ADF/Synapse dataflow.

You can decrease the cluster start-up time by using the time to live (TTL) feature.

Cluster start-up time is the time it takes to spin up an Apache Spark cluster. This value is located in the top-right corner of the monitoring screen. Data flows run on a just-in-time model where each job uses an isolated cluster. This start-up time generally takes 3-5 minutes. For sequential jobs, this can be reduced by enabling a time to live value.

For more information, refer to the Time to live section in Integration Runtime performance.

You can also optimize the performance of your data flows by using the Optimize tab in the data flow transformations. The Optimize tab contains settings to configure the partitioning scheme of the Spark cluster. Adjusting the partitioning provides control over the distribution of your data across compute nodes and data locality optimizations that can affect your overall data flow performance.

Suppose you do not require every pipeline execution of your data flow activities to fully log all verbose telemetry logs. In that case, you can optionally set your logging level to "Basic" or "None". When executing your data flows in "Verbose" mode (default), you request the service to fully log activity at each individual partition level during your data transformation. This can be expensive, so only enabling verbose when troubleshooting can improve your overall data flow and pipeline performance.

Reference document:

https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/data-factory/concepts-data-flow-performance.md

I hope this helps. Please let me know if you have any further questions.

Share via

Azure Pipeline optimization: cluster setup time in Azure Synapse/Data factory

1 answer

Your answer