dataflow timesout

arkiboys 9,706 Reputation points
2022-11-19T22:17:38.647+00:00

hello,
My pipeline seems to sometimes time-out.
the orchestration is as folows:

there are 10 pipelines in the data factory.
each pipeline has a number of activities, i.e. copy, dataflow, etc.
There is one dataflow named as df_audit
before and after each activity I have df_audit which writes into the datalake in parquet format details such as:
pipeline name, activity name, and some other parameters...
inside df_audit, there are 5 transformations:
1- source--> this has dataset, ds_sourceAudit pointing to a dummy.csv file which has one dummy column name and one dummy row value, i.e. cell A1 is dummyCol, cellA2 is dummyRow
ds_sourceAudit is pointing to the dummy.csv and everything else is set to default in this dataset

2- add column transform
3- select transform
4 - alter row --> insert if - true()
5- sink inline dataset delta pinting to a folder in the container in datalake

Tests carried out:
I can see in debug that each time the df_audit is run, it takes on average 4 mins but sometimes it timesout as the setting by default is 10 mins
I even tried running different pipelines in parallel and it seems df_sourceAudit as well as a pipeline which only has the df_audit dataflow. The pipeline with only the df_sourceAudit completes quickly but the other pipelines which have a few activities, gets delayed on df_audit for around 4 mins.
Do you see what causes the df_audit to timeout sometimes?
In the debug glass window, I see the df_audit that shows as queued most of the time for long time...

Here is a screenshot of the ds_sourceAudit of the df_audit

262153-image.png

Thank you

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
{count} votes

1 answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,737 Reputation points Microsoft Employee Moderator
    2022-11-21T19:16:33.727+00:00

    Hello @arkiboys ,

    Thanks for the question and using MS Q&A platform.

    As per my understanding you have a dataflow which is pointing to a dummy csv file and when you use that dataflow in a separate pipeline (no other activities expect data flow) and it goes through without any issues but in a different pipeline where there are other activities including the above dataflow activity, sometimes it is failing and sometimes it goes through fine. Please correct me if my understanding is not accurate.

    Since you are pretty sure that the timeout is happening at the dataflow activity, have you got a chance to verify where exactly is the timeout happening? Is it taking more time to spin up the compute or is it taking more time while executing the transformations within your dataflow activity?

    I would recommend debugging the pipeline from end to end and step by step to identify where exactly the time out is happening. Please refer to this troubleshooting guide and see if that helps to identify the root cause and solution: Troubleshoot mapping data flows in Azure Data Factory

    If you are sure that the dataflow is the one that is causing the timeout, then please follow this doc for improving the performance of the dataflow: Mapping data flows performance and tuning guide

    Hope this info helps.

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.