In spite of having an azure runtime allocated, each activity has a separate "AcquiringCompute" step?

Akshay Mahajan 21 Reputation points
2020-08-04T17:55:37.84+00:00

Question: In our pipeline, we have around 10 mapping data flow activities, in serial fashion (one after another). Each of them are configured to use the same integration runtime (azure managed runtime). In spite of this configuration, each activity shows 3-4 minutes of "AcquiringCompute" step, not sure why?

This was understood if we had used auto integration runtime. But that is not the case. Also it's. understood if the first activity in the pipeline shows this cluster startup time. Why each activity? Then what's the difference between an auto integration runtime vs. an azure runtime?

Impact: This makes my process run for 46 minutes instead of 16 minutes. Which is a huge problem.

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,814 questions
{count} votes

Accepted answer
  1. KranthiPakala-MSFT 46,502 Reputation points Microsoft Employee
    2020-08-04T21:27:44.423+00:00

    Hi @Akshay Mahajan ,

    Welcome to Microsoft Q&A platform and thanks for your query.

    Yes, you are correct, if you leave the TTL to 0, ADF will always spawn a new Spark cluster environment for every Data Flow activity that executes. This means that an Azure Databricks cluster is provisioned each time and takes about ~4 minutes to become available and execute your job.

    With the TTL feature, you only need to incur the cluster start-up on the first data flow activity execution. After that, we keep VMs available in a pool for the length of time of your TTL setting. And the subsequent data flow activities would still take ~1 min because you will still receive a new Spark context for each execution.

    For example:

    Without TTL : (Spin up 1 + Run 1) + (Spin up 2 + Run 2) + (Spin up 3 + Run 3) ...
    (~4min + Job execution) + (~4min + Job execution) + (~4min + Job execution).....

    With TTL : (Spin up a cluster + Run 1) + (Run 2) + (Run 3) ...
    : (~4min + job execution) + (~1min +job execution) + (~1min +job execution)...

    And to answer your question, yes if your Data flow activities are sequential then using TTL is the appropriate solution.

    Additional info:

    Please refer to below docs:

    Hope this helps.

    ----------

    Thank you
    Please do consider to click on "Accept Answer" and "Upvote" on the post that helps you, as it can be beneficial to other community members.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.