Limit Databricks Job Clusters when Started By ADF

Justin Fainges 25 Reputation points
2023-08-29T05:00:40.3633333+00:00

I'm using ADF to run ELT pipelines. As part of the process, each pipeline (29 so far) uses a copy activity to pull data from various sources, then a Databricks script to do the transformations.

The process works fine, but each pipeline is attempting to spin up a new job cluster and we quickly hit our cap of 50 cores so the other pipelines fail.

This question has been asked a few times, but the only answer seems to be 'increase your core limit' please note I DO NOT wish to increase my limit. I'm happy for these jobs to be queued until cores are available.

My questions is, is it possible to set a maximum for the number of cores used, and have ADF/Databricks queue any remaining jobs?

I know one option is to query the Databricks API for running jobs and create activities to wait until cores are free, but this logic would need to be added to every pipeline which is not ideal.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
1,913 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
9,514 questions
0 comments No comments
{count} votes

Accepted answer
  1. Amira Bedhiafi 15,046 Reputation points
    2023-08-29T13:54:20.7533333+00:00

    In the Databricks workspace, when you define a job, you can specify the configuration of the new job cluster it will run on. This includes specifying the number of worker nodes, which implicitly sets the number of cores used.

    When setting up the Databricks Linked Service in ADF (#2), you can set the Existing Cluster ID. This will mean every ADF pipeline that uses this linked service will use that existing cluster rather than creating a new job cluster. This will make your jobs run serially on that cluster if they overlap, thereby avoiding the problem of maxing out cores.

    Databricks has a feature called Pools (#3) which can be utilized to pre-warm clusters. By leveraging pools, you can have clusters that start faster and you can also set a cap on the number of concurrently active clusters in a pool. This way you can control the concurrency at the pool level.

    Use ADF's pipeline concurrency settings. You can limit how many instances of a particular pipeline run at the same time.

    Set up dependency conditions in ADF such that certain activities don't run unless others have completed. This is more manual and requires foresight into which jobs might be running concurrently.

    Rather than querying the Databricks API directly in every ADF pipeline, you can create an Azure Logic App that manages the execution queue for your Databricks jobs. The Logic App can check if there's available capacity before triggering a new Databricks job via ADF. This abstracts away the check from individual ADF pipelines and centralizes the logic.

    If many of your transformations share common logic, consider refactoring your jobs and using Databricks libraries to modularize and centralize some of the logic. This can reduce the number of distinct jobs you need to run, possibly reducing the need for so many concurrent clusters.

    You can use a combination of #2 and #3 to manage the concurrency and core usage of your Databricks jobs initiated from ADF.

    1 person found this answer helpful.

0 additional answers

Sort by: Most helpful