Discrepancy in ADF Data flow Activity Execution Time and Sink Processing Time in ADF Jobs

Amit Kumar 0 Reputation points
2023-11-01T20:46:52.49+00:00

I've been observing a significant time difference between the execution duration of my Azure Data Factory (ADF) dataflow activities and the actual completion time of the sink processing including Cluster startup time within the activity. For instance, in a recent example, the dataflow activity started at 11:00:07, and its duration was recorded as 1 minute and 28 seconds, indicating an end time of 11:01:35. However, while monitoring the data flow details, I noticed that the sink processing had already completed within 1 sec 301 ms with cluster startup time of 1 s 263 ms.

The below snapshot shows the Dataflow status as “Success” at 11:00:29 but the actual pipeline is still in process.

11

22

Given this discrepancy, I am curious to understand why the dataflow activity remains in progress at the pipeline level even after the sink processing has been completed. This issue becomes particularly pertinent as I am running multiple pipelines with Data flows in my project, leading to a substantial accumulation of time differences. I am utilising a managed vnet IR with a memory-optimised 16 (+16 driver cores) and have set the Time To Live to 30 minutes.

I would appreciate any insights or guidance on potential causes for this discrepancy and any recommendations on how to address it effectively.

Thanks,

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,706 questions
{count} votes

1 answer

Sort by: Most helpful
  1. Konstantinos Passadis 19,166 Reputation points MVP
    2023-11-02T00:38:39.4033333+00:00

    Hello @amit kumar !

    Welcome to Microsoft QnA!

    When we execute Data Flow , Pipelines etc , the whole process contains a lot of other sub tasks :

    Preparation meaning parsing the data flow, resolving dependencies, and preparing the execution plan.

    Compute - Resources: it may take some time to start up the necessary resources. This is especially true if you're using on-demand compute resources, which can have a significant start-up time.

    Execution: the actual running of the data flow, including source data retrieval, transformations, and finally sinking the data.

    Resources Turning off: After execution, if the Time To Live (TTL) for the cluster has expired, or if the cluster is not set to remain active, it will be shut down.

    Post-Processing: meaning logging, updating ADF metadata, and other necessary clean-up tasks

    In fact the execution time for the sink in the data flow is a fraction of the overall process

    I suggest to have a look

    https://learn.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime-performance

    And this great article :

    https://mrpaulandrew.com/2019/12/18/best-practices-for-implementing-azure-data-factory/

    I hope this helps!

    Kindly mark the answer as Accepted and Upvote in case it helped!

    Regards

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.