Azure - Reduce Data Flow Activity Time

D B 21 Reputation points
2020-06-22T15:06:58.83+00:00

Hi there

I need to look for a way to reduce the time that the pipeline is consuming. I've got a simple Pipeline, that has two data flow activities, that is taking a lot of time for processing only one register:

1) Sorce to Staging: Overall time 5m 10 sec - Processing time 11s 803ms

2) Staging to DWH: Overall time 5m 2 sec - Processing time 2s 757ms

I read that this is because every Data Flow will require between 5-7 min for the cluster startup time and that it is necessary to modify the TTL of the Azure IR.

These are my questions. If the Azure IR is modified, how will this affect those Pipelines that only have one Data Flow activity, are the going to experience any decrease in the execution time?. In my example pipeline, it is possible to modify the second Data Flow to implement a Stored Procedure, by doing this, what should happen with the execution time of the entire pipeline? And finally, what is the price of modify the TTL?

I do not have permission for setting these values or modify the ETL, that is why I'm asking, I would like to be sure about this before making any proposal.

Regards,

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,159 questions
0 comments No comments
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,442 Reputation points Microsoft Employee
    2020-06-22T21:05:53.183+00:00

    Hi DB-9790,

    Welcome to Microsoft Q&A platform and thanks for your query.

    1. If you leave the TTL to 0, ADF will always spawn a new Spark cluster environment for every Data Flow activity that executes. This means that an Azure Databricks cluster is provisioned each time and takes about ~ 4-5 minutes to become available and execute your job.
    2. If you set a TTL, then the minimum billing time will be that amount of time. ADF will maintain that pool for the TTL time after the last data flow pipeline activity executes. Note that this will extend your billing period for a data flow to the extended time of your TTL.
    3. If you have a pipeline with single data flow activity then it is better to use an Azure IR without TTL, since it will be billed only for the time to acquire compute + job execution time. In case if you set the TTL and use it only for single data flow activity, then billing = time to acquire warn pool + job execution time + TTL time after the last data flow pipeline activity executes.
    4. The TTL setting is helpful when you have a pipeline with sequential data flow executions. Which will allow you to stand-up a pool of cluster compute resources for your factory. With this pool, you can sequentially submit data flow activities for execution. Once the pool is established (The initial set-up of the resource pool will take around ~5 minutes), each subsequent job will take 1-2 minutes for the on-demand Spark cluster to execute your job (i.e., ~5min +2min + 2min + 2min + ...). In case if TTL is not set, then each subsequent job also will take ~ 5min (i.e., ~5min + ~5min+ ~5min+ ... ).
    5. I would recommend to have two different Azure IR's (one with no TTL and other with TTL set)
      a. For pipelines with single data flow activity - Use Azure IR without TTL
      b. For pipelines with sequential data flow activities - Use Azure IR with TTL set.
    6. For Data flow execution pricing please refer to below docs:
      a. ADF Data flow execution pricing
      b. Understanding Data Factory pricing through examples

    Hope this info helps. Do let us know if you have further query.

    ----------

    Thank you

    3 people found this answer helpful.

2 additional answers

Sort by: Most helpful
  1. Kiran-MSFT 691 Reputation points Microsoft Employee
    2021-04-20T05:04:33.84+00:00

    Whether it be the same pipeline or another using the same IR will reuse the cluster if it is NOT in use(idling on TTL). If the IR is already running another dataflow, it will spin up another cluster.

    1 person found this answer helpful.

  2. Al 21 Reputation points
    2021-11-01T08:49:03.777+00:00

    Kiran/Kranthi,

    Do you know anything about the 'Quick Re-use' functionality being available on Synapse? We have to (still) wait for 1-2 minutes just for the next dataflow to be kicked off, and that's WITH the TTL setting enabled. You can say it's an improvement from apprx 5-6 minutes to spin off a new cluster but it's still not the best outcome given we have hundreds of dataflows that need to be executed in sequence.

    Mark Kromer in one of his posts https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365 referred to this functionality in ADF context. What's the latest and the greatest on this and is there a roadmap to make it available on Synapse?

    Thanks
    Alex