Apparantly failed Parallel Execution of Multiple Pipelines from Foreach Activity in Azure Data Factory

Sushant Upadhyay 1 Reputation point
2021-06-17T04:17:10.053+00:00

I have an ADF (Controller)pipeline which i use to train ML models on Azure Databricks. I need to train models on multiple datasets. So i have created another ADF ( Worker) pipeline within which i use Azure Databricks Activity to train a model using MLLib. The Worker pipeline and the Databricks Linked Service are both parameterized. Since the training tasks can run for upto 30 hours, it is required that training on all datasets goes on in parallel with each dataset being used on a separate Databricks cluster. Hence, i trigger the Worker pipeline(using ExecutePipeline Activity) for each dataset separately from within a ForEach Activity, with the 'Sequential' checkbox unchecked, and passing the new dataset ADLS)locations as Worker pipeline parameters for the Notebooks to load.

To test the setup, i used very very minute datasets so that each training cycle needs only 1 minute as opposed to 30 hours.
But having a look at Azure Databricks Jobs tab to see the allocation of clusters, i found that they are being provisioned 'sequencially' !! i.e. a job starts only after the previous job has ended and the previous cluster has 'terminated' !

I am a bit at a loss regarding this.
Please help me with this as we need to train models on around 45 different datasets so we would need to start 45 different clusters so that all the training is done in 30 hours.. As would be evident, running 45 30 hour jobs sequentially is clearly not an option !!

Thanks,
Sushant

Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,556 questions
{count} votes

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.