ADF: For Each Not Reaching Max Concurrency or Batch Size

Romano, Anthony 21 Reputation points
2021-04-20T16:54:15.383+00:00

Summary:
I have a for each loop that calls a pipeline with varying parameters. I am running into the problem that the for each loop is not executing the maximum number of pipelines at any given time, and near the end seems to do them one by one.

Details:
For Each:
Batch Size = 20

Pipeline:
Max Concurrency = 20

Expected Behavior:
I would expect that the number of pipeline executions always meet the maximum number (based on Batch size and Max Concurrency), excluding the case when there are fewer pipelines than are allowed.

Current Behavior:
Initially, the number of pipeline executions matches the maximum number allowed. However, after the initial load, it starts trickling off until it does only one at a time.

If anyone has any details on how to fix this behavior, please let me know.

Thanks,

Anthony R
Data Engineer

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,624 questions
0 comments No comments
{count} votes

Accepted answer
  1. David Beavon 991 Reputation points
    2021-05-11T13:17:45.9+00:00

    @Romano, Anthony @KranthiPakala-MSFT

    This is a known issue. ADF doesn't do any dynamic loadbalancing.

    It is a primitive scheduler that divides the work into buckets at the start of the operation, and those buckets don't get replenished with any additional/remaining work once they are emptied.

    Here is a reference:

    https://learn.microsoft.com/en-us/azure/data-factory/pipeline-trigger-troubleshoot-guide#degree-of-parallelism--increase-does-not-result-in-higher-throughput

    "The queues are pre-created. This means there is no rebalancing of the queues during the runtime."

    ADF is great for certain activities (ie. copy activities that ingest data into ADLS parquet). I think the intention is that ADF activities need to be driven by a "real" compute resource that can do the top-level orchestration. Eg. You can trigger your ADF pipelines using REST from a Web Job in Azure Service or from Azure Functions. As long as your top-level orchestration is outside of ADF, you have a lot of flexibility.

    I had opened a support case and the documentation you see in the link above was the resolution of the case. Ideally ADF would use dynamic loadbalancing since that is how customers would expect it to behave.

    Another solution is to organize your list of work manually with longer tasks sequenced on one side of the list and shorter tasks on the other. When this type of list is parallelized, it might end up being fairly well balanced. Unfortunately this is a hack and you can't always predict the amount of time that activities will take.


1 additional answer

Sort by: Most helpful
  1. KranthiPakala-MSFT 46,642 Reputation points Microsoft Employee Moderator
    2021-04-21T21:14:32.16+00:00

    Hi @Romano, Anthony ,

    Welcome to Microsoft Q&A forum and sorry for the delay in resposne.

    As per the description you have provided, I believe that you might be running your parent pipeline (which contains ForEach activity) in debug mode and not a Trigger run.

    When you debug run your parent pipeline, all the activities inside the foreach loop will be executed sequentially even though you set "isSequential": false, and each execute pipeline activity will wait on completion for debugging purposes. For triggered runs, the foreach loop will use the defined ‘Batch count’ for parallel executions.

    I have tested this scenario to verify if it is a bug or an expected behavior and I can confirm that this is an expected behavior when you debug run. But when actually Trigger run your pipeline, ForEach loop will behave based on your Batch count for parallel executions.

    Please see below GIF:

    90111-foreachloopbatchcount.gif

    Additional info: I have attached my test pipelines JSON Payload in case if you would like to go through the configuration or testing
    90055-repro-foreachbatchcountconcurrentruns.txt
    90066-pipelinechild.txt

    Hope this clarifies. Do let us know if you have further query.

    ----------

    Please don’t forget to Accept Answer and Up-Vote wherever the information provided helps you, this can be beneficial to other community members.


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.