How to configure ADF pipeline run, linked service, so it uses Databricks serverless compute

Question

How to configure ADF pipeline run, linked service, so it uses Databricks serverless compute

Krzysztof Przysowa 20

Databricks has recently announced serverless compute for workflows:

https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/run-serverless-jobs

I would like to be able to execute Azure Data Factory (ADF) jobs using this functionality.

Currently, for job compute I have to specify driver and worker type, with serverless it is not needed.

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-05-02T10:41:41.7166667+00:00

@Krzysztof Przysowa Just checking in to see if the below answer helped. If this answers your query,. Please do click Accept Answer and Yes for was this answer helpful. which might be beneficial to other community members reading this thread. if you have any further query do let us know.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-09-18T16:18:43.5866667+00:00

@Krzysztof Przysowa Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

4 answers

Your answer

phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-05-02T10:41:41.7166667+00:00

@Krzysztof Przysowa Just checking in to see if the below answer helped. If this answers your query,. Please do click Accept Answer and Yes for was this answer helpful. which might be beneficial to other community members reading this thread. if you have any further query do let us know.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-09-18T16:18:43.5866667+00:00

@Krzysztof Przysowa Just checking in to see if the below answer helped. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 1

phemanth 15,765 Microsoft External Staff Moderator

@Krzysztof Przysowa

Thanks for using MS Q&A platform and posting your query.

Serverless compute for workflows allows you to run your Databricks job without configuring and deploying infrastructure With serverless compute, you focus on implementing your data processing and analysis pipelines, and Databricks efficiently manages compute resources, including optimizing and scaling compute for your workloads

To configure your Azure Data Factory (ADF) pipeline to use Databricks serverless compute.

Here are the steps to configure an existing job to use serverless compute:

Create a Linked Service for Databricks: On the ADF home page, switch to the Manage tab in the left panel. Select Linked services under Connections, and then select + New. In the New linked service window, select Compute > Azure Databricks, and then select Continue.
Configure the Linked Service: In the New linked service window, complete the following steps1:
- For Name, enter AzureDatabricks_LinkedService.
- Provide the necessary details for your Databricks workspace, such as the URL and access token.
Configure the ADF Pipeline: When creating or editing a pipeline in ADF, you can specify the Databricks linked service as the compute environment for your activities
Parametrize the Spark Configs: If you want to parametrize the spark config values as well as keys, you can do so when writing an ARM template for Data Factory. In the “Microsoft.DataFactory/factories/linkedservices” resource, you can define the newClusterSparkConf.
Use Serverless Compute with Databricks Jobs: To learn more about using serverless compute with your Azure Databricks jobs, you can refer to the official documentation
Open the job you want to edit.
In the Job details side panel click Swap under Compute.
Click New, enter or update any settings, and click Update.
Alternatively, you can click in the Compute drop-down menu and select Serverless.

please go through the link for more details:https://docs.databricks.com/en/workflows/jobs/run-serverless-jobs.html

Please note that your Databricks workspace must have Unity Catalog enabled and your workloads must support shared access mode.. Also, your Azure Databricks workspace must be in a supported region.

You can also automate creating and running jobs that use serverless compute with the Jobs API, Databricks Asset Bundles, and the Databricks SDK for Python.

please refer.

https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/run-serverless-jobs

https://learn.microsoft.com/en-us/azure/databricks/workflows/jobs/use-compute

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Krzysztof Przysowa 20 Reputation points

2024-05-01T13:25:50.15+00:00

@phemanth, many thanks for your answer, unfortunately it does NOT answer my query
My question was how to configure it in Azure Data Factory (ADF), your answer points to the databricks workflows in Databricks workspace portal.

Please note that I am using Azure Data Factory (ADF) for orchestration and Databricks jobs as compute.
Please advise what to change in Azure Data Factory Databricks Linked Service to run serverless compute? Currently I can only select one of the fixed size clusters.

Please note that Databricks serverless compute for workflows is a brand new, Public Preview functionality
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-05-01T13:57:36.73+00:00

@Krzysztof Przysowa

To use serverless compute in Azure Data Factory (ADF) with Databricks, you would typically configure this in the Databricks workspace and then use the configured workspace in ADF. However, there was no direct way to specify serverless compute in the ADF Databricks Linked Service configuration.

As per the latest information , the configuration of compute environments in ADF is typically done in the linked service settings, but it does not provide an option to specify serverless compute for Databricks.

One potential workaround could be to create a cluster with serverless compute in Databricks and then use the cluster ID in your ADF Databricks Linked Service or activity However, this might not fully utilize the benefits of serverless compute as the cluster would need to be manually managed.

I recommend checking the latest Azure Data Factory documentation.
Krzysztof Przysowa 20 Reputation points

2024-05-01T14:12:07.3466667+00:00

@phemanth,
Many thanks for a super prompt feedback. I came to a similar conclusions.
FYI I have tried to do what you have suggested i.e.: by specifying the interactive 'serverless' cluster id, but it is not sustainable solution

Is there any way that you can request internally this functionality / setup to be added?
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-05-01T14:19:14.2166667+00:00

@Krzysztof Przysowa Sure, appreciate if you could share the feedback on our feedback channel. Which would be open for the user community to upvote & comment on. This allows our product teams to effectively prioritize your request against our existing feature backlog and gives insight into the potential impact of implementing the suggested feature.

https://feedback.azure.com/d365community/forum/1219ec2d-6c26-ec11-b6e6-000d3a4f032c

Hope this helps. Do let us know if you any further queries.

Please don't Forget To click Accept Answer and Yes for was this answer helpful. which might be beneficial to other community members reading this thread. if you have any further query do let us know.
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-09-11T15:37:17.24+00:00

@phemanth As of now, there is no official update on the native implementation of serverless job cluster functionality directly within Azure Data Factory (ADF). The current workaround involves using the Databricks REST API with ADF’s web activity to leverage serverless compute.

However, Databricks serverless compute for workflows is generally available, allowing users to run their Databricks jobs without configuring and deploying infrastructure2. This feature optimizes and scales compute resources automatically based on workload requirements.

The community’s feedback has been acknowledged, and there are ongoing discussions about integrating this functionality natively into ADF. Unfortunately, there is no estimated time of arrival (ETA) for this feature yet.

If you have any further questions or need assistance with the current workarounds, feel free to ask!
phemanth 15,765 Reputation points Microsoft External Staff Moderator

2024-09-16T16:55:10.85+00:00

@Krzysztof Przysowa Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

Answer 2

PRADEEPCHEEKATLA 90,646 Moderator

@Krzysztof Przysowa - Thanks for the question and using MS Q&A platform.

Here is an update from internal team:

The only way to make it work would be to use the Databricks REST API with ADF's web activity.

For more details, refer to Azure Databricks REST API -Jobs API 2.0 and Web activity in Azure Data Factory and Azure Synapse Analytics.

Here is an third-party which explains on how to run databrick rest api with ADF web activity: Azure Data Factory integration with Databricks Workflows.

Hope this helps. Do let us know if you any further queries.

If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.

PRADEEPCHEEKATLA 90,646 Reputation points Moderator

2024-05-15T07:56:16.4166667+00:00

@Krzysztof Przysowa - Following up to see if the above answer was helpful. If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.
Krzysztof Przysowa 20 Reputation points

2024-05-20T13:46:29.1633333+00:00
Hi @PRADEEPCHEEKATLA ,
Many thanks for sharing this answers. There are useful workarounds, but of course the best solution for that problem would be to:

implement serverless clusters as a native ADF linked service functionality, or at least change the limitations/validations there so we can submit jobs without specifying cluster as this will allow to use serverless functionality

add native support to run databricks workflows in a synchronous mode, so it will wait for the workflow to finish in a single activity

I have added or upvoted the ADF feedback for the above functionality. Are you aware if there are any plans to introduce it?
PRADEEPCHEEKATLA 90,646 Reputation points Moderator

2024-05-21T04:35:19.3033333+00:00

@Krzysztof Przysowa - Yes, this is a known limitation with Azure Data Factory. We had created a feature to bring this feature in ADF and we don't have any ETA when it will be available.

I will update this thread once it's available.

Please don’t forget to Accept Answer and Yes for "was this answer helpful" wherever the information provided helps you, this can be beneficial to other community members.
Krzysztof Przysowa 20 Reputation points

2024-09-05T13:14:39.78+00:00

Hi @PRADEEPCHEEKATLA-MSFT ,
Do you have any update on the serverless job cluster functionality in Azure Data Factory?
The community is trying to implement different workarounds, but what we need is actual native implementation (it should not be a complex thing as it is a synchronous api call).

Answer 3

Whether ADF and Databricks Jobs use Serverless resources is unrelated. The following are the operational steps:

Create an interactive cluster in Databricks and stop it.
In ADF, create a Linked Service object for Databricks using the interactive cluster method (this is only needed to enable API authentication for ADF to trigger Databricks Jobs later).
Create a Job in Databricks as usual, and configure the Task within the Job to use Serverless resources under Job Compute.
In ADF, add a Databricks Job activity. In the configuration, select the Linked Service created in Step 2 and choose the Databricks Job name or Job ID configured in Step 3.

By triggering the pipeline configured above, you will see that the Databricks Job is successfully scheduled. At the same time, the previously referenced interactive cluster will not be started. Upon further observation, you’ll notice that the entire process works as follows: ADF uses the Linked Service to authenticate and trigger the Databricks Job in the background, and then the Job itself initiates the Serverless compute to carry out the subsequent task execution.

(This approach does not affect your previous development experience. All you need to do is provide an existing interactive cluster when registering the Linked Service—this is only to ensure the Linked Service can be created successfully. It will not incur any additional cost or have any further impact.)

Answer 4

Hi @Yunpeng Tang ,
Many thanks, as I understood correctly, in summary the solve to my problem is to use new Job activity, which is currently in preview mode.
The job activity really runs the databricks job / workflow in the synchronous way, so no more of submit / check for results in the loop logic.

So, the provided method is a workaround to be able to use Databricks orchestration from ADF, my question was really about being able to use Databricks serverless compute directly without extra layer of orchestration.

Are you aware if there any plans to introduce true support for serverless clusters / libraries for them in ADF?
Maybe also the SQL script execution on the sql warehouse?
The real breakthrough where comes to integration of the ADF with databricks would be the ability to reuse the job clusters in subsequent activity without necessity to wait for creation for the new one.

Let me give it a try and get back to you.

Share via

How to configure ADF pipeline run, linked service, so it uses Databricks serverless compute

4 answers

Your answer