How to install maven library while running databricks notebook using new job cluster through azure data factory?

Prakamya Aishwarya 121 Reputation points
2023-05-27T06:09:06.5333333+00:00

We are migrating out jobs from interactive cluster to job clusters. I wanted to check if we can specify which libraries to install on a linked service level.

Please note I am already aware of the option to install library from the notebook activity, but do not want to use that since it will require updating lot of activities.

Azure Databricks
Azure Databricks
An Apache Spark-based analytics platform optimized for Azure.
2,070 questions
Azure Data Factory
Azure Data Factory
An Azure service for ingesting, preparing, and transforming data at scale.
10,121 questions
{count} votes

2 answers

Sort by: Most helpful
  1. VasimTamboli 4,785 Reputation points
    2023-05-27T06:26:39.78+00:00

    To install a Maven library while running a Databricks notebook using a new job cluster through Azure Data Factory, you can follow these steps:

    Open your Azure Data Factory instance and go to the "Author & Monitor" section.

    Navigate to the pipeline where you want to configure the library installation.

    In the pipeline, locate the Databricks notebook activity that you want to run.

    Within the Databricks notebook activity, expand the "Advanced" settings.

    In the "Advanced" settings, locate the "Libraries" section.

    Within the "Libraries" section, click on the "Linked service" drop-down menu and select the linked service associated with your Databricks workspace.

    Once the linked service is selected, you will see the option to specify the libraries to be installed. Click on the "Add library" button.

    In the library configuration dialog, choose the "Maven" option.

    Specify the Maven coordinates for the library you want to install, including the group ID, artifact ID, version, and any other relevant details.

    Click on the "OK" button to save the library configuration.

    Save and publish the changes to the pipeline.

    By specifying the library installation at the linked service level, the library will be automatically installed whenever the Databricks notebook activity runs using the job cluster. This way, you won't need to update individual activities, making it easier to manage library dependencies across multiple pipelines and notebooks.


  2. PRADEEPCHEEKATLA-MSFT 85,121 Reputation points Microsoft Employee
    2023-06-05T04:58:35.4233333+00:00

    @Prakamya Aishwarya - Thanks for the question and using MS Q&A platform.

    You can install a cluster library directly from a public repository such as PyPI or Maven, using a previously installed workspace library, or using an init script.

    You can pass the same path of the init scripts in the ADF linked service as shown below.

    User's image

    When I ran the Databricks Notebook activity, I'm able to install the initscript succesfully without any issue.

    User's image

    Here is the Event logs of the cluster:

    User's image

    For more details, refer to Azure Databricks - Cluster libraries.

    Hope this helps. Do let us know if you any further queries.


    If this answers your query, do click Accept Answer and Yes for was this answer helpful. And, if you have any further query do let us know.