Reusing a single Databricks Cluster Pool on an entire Azure Data Factory pipeline

Guilherme Gaspar Monteiro 25 Reputation points
2024-01-25T19:42:49.32+00:00

I'm currently trying to execute an ADF pipeline with a Databricks cluster pool. However I'm facing a problem:

Whenever the pipeline finishes a task (i.e. a databricks notebook) the cluster pool terminates and starts over again for the next task. Unfortunately, this means that there is a start-up time (around 5 min) for each task in the pipeline, making it no more time-efficient than purely using a job cluster.

I was wondering if there could be a way to allocate a cluster pool to an entire ADF pipeline and have it run from beginning to end on that same cluster pool, without terminating and restarting on each task.

I know I can leverage Databricks Workflows and call it via API on ADF, but my entire workflow is already built on ADF. Thank you for your time

Azure Virtual Machines
Azure Virtual Machines
An Azure service that is used to provision Windows and Linux virtual machines.
8,331 questions
Azure
Azure
A cloud computing platform and infrastructure for building, deploying and managing applications and services through a worldwide network of Microsoft-managed datacenters.
1,099 questions
Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,325 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
11,209 questions
0 comments No comments
{count} votes

Accepted answer
  1. Sina Salam 17,176 Reputation points
    2024-01-25T20:47:18.18+00:00

    Hello @Guilherme Gaspar Monteiro ,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    To reuse a single Databricks Cluster Pool on an entire Azure Data Factory pipeline.

    Due to your specific requirements and constraints, it seems that using a Databricks Workflow triggered by Azure Data Factory (ADF) might be the most suitable solution. Although your entire workflow is built on ADF, integrating Databricks Workflows into your pipeline through API calls provides a way to achieve the desired behavior without restarting the cluster pool for each task.

    My best suggestion for best practices: Once you organize your workflow and set-up your Databricks cluster pool within the Databricks workspace. Use the Databricks REST API to trigger the Databricks Workflow from your Azure Data Factory pipeline. You can pass parameters from ADF to Databricks using the API call, after you must have allowed Databricks workflow to pass parameters. Then, you have to maintain the continuity of the cluster pool throughout the entire workflow by implement monitoring within your ADF pipeline, if the ADF serves as the trigger and monitoring mechanism, then it will avoid the startup time for each task.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    Please remember to "Accept Answer" if answer helped, so that others in the community facing similar issues can easily find the solution.

    Best Regards,

    Sina Salam


0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.