Azure DataFlow Isuue

Rahul Ahuja 1 Reputation point
2022-11-25T12:41:24.69+00:00

Hi,

I am running a pipeline which is containing only a Dataflow. This dataflow is just reading the data from json and inserting data to azure SQL. My source JSON containing 9 to 10 lakhs files . When I ran the pipeline for the first time, It executed fast and successful. And the data is inserted in DB.

When I rerun the pipeline, pipeline run is not completing. it keeps on running for more than 40 mins. No errors are throw, and the run is not completed. When I click on the debugging, it shows the activity is queued. It is in the same state for very long time.
It is queued in at the first step(files are reading from DL storage).

Why this issue is happening? wat could be the issue?

We are unable to find the cause.

Please have a look into below points:

  • Our DF took about 2.5 hours to finish in the last run, when we had about 9 lakh files, but because the DF runtime was 19 minutes and 27 seconds, almost 2 hours of the DF were spent in a queued state.

264187-image.png
264263-image.png

  • Here is IR Configuration setting
    264264-image.png
Microsoft Authenticator
Microsoft Authenticator
A Microsoft app for iOS and Android devices that enables authentication with two-factor verification, phone sign-in, and code generation.
5,486 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,525 questions
Azure Policy
Azure Policy
An Azure service that is used to implement corporate governance and standards at scale for Azure resources.
793 questions
{count} votes

1 answer

Sort by: Most helpful
  1. BhargavaGunnam-MSFT 25,976 Reputation points Microsoft Employee
    2022-11-30T23:06:49.58+00:00

    Hello @Rahul Ahuja ,
    I have discussed the issue with my internal team. They have advised the below to resolve the issue.

    1) It seems, the core count is not enough to process the data. Please increase the core count in the data flow cluster
    (or)
    2) Process the data in small chunks:

    One way to process data in smaller chunks:

    Assume you have one large data file, using a Data Factory Source

    You could set Sampling ON in that source activity, and set the Rows Limit by a passed job parameter via Add dynamic Content.

    You'd have to wrap your Data Factory Job in some sort of Iteration Activity (e.g. Until) in a called Pipeline. I'd probably check for the output of your Source Activity and, if the number is LESS than the value of the parameter you pass in, then break out of the Iteration Loop.

    One Caveat: This may NOT do well with "No Records Returned". May have to do some edge testing for that. (e.g. you process 1000 rows at a time, and there are exactly 1000 rows in the last batch, to the NEXT batch processes zero - does the Source activity return an output value of zero? Or does it not have any results at all?)

    A similar thread has been discussed here.

    I hope this helps. Please let me know if you have any further questions.