Install libraries from a package repository

Azure Databricks provides tools to install libraries from PyPI, Maven, and CRAN package repositories.

Important

Default behavior for the library upload UI has changed. Legacy behavior always stored libraries in the DBFS root. All workspace users have the ability to modify data and files stored in the DBFS root.

The default location for library uploads is now workspace files. Databricks recommends uploading libraries to workspace files or Unity Catalog volumes, or using library package repositories. If your workload does not support these patterns, you can also use libraries stored in cloud object storage.

Installing libraries vs. creating workspace libraries

The user interfaces for installing libraries from package repositories to a cluster and uploading them to the workspace are almost identical. The instructions in this article describe how to install libraries directly to clusters. If you want to add libraries to the workspace, click Create as the final step for each repository instead of Install.

Libraries added to the workspace from package repositories do not upload files to the DBFS root, but serve as pointers to the library and repository specified during configuration.

See Workspace libraries and Cluster libraries.

PyPI package

  1. In the Library Source button list, select PyPI.

  2. Enter a PyPI package name. To install a specific version of a library, use this format for the library: <library>==<version>. For example, scikit-learn==0.19.1.

    Note

    For jobs, Databricks recommends that you specify a library version to ensure a reproducible environment. If the library version is not fully specified, Databricks uses the latest matching version. This means that different runs of the same job might use different library versions as new versions are published. Specifying the library version prevents new, breaking changes in libraries from breaking your jobs.

  3. (Optional) In the Index URL field enter a PyPI index URL.

  4. Click Install or Create. See Installing libraries vs. creating workspace libraries.

Maven or Spark package

Important

To install Maven libraries on compute configured with shared access mode, you must add the coordinates to the allowlist. See Allowlist libraries and init scripts on shared compute.

  1. In the Library Source button list, select Maven.

  2. Specify a Maven coordinate. Do one of the following:

    • In the Coordinate field, enter the Maven coordinate of the library to install. Maven coordinates are in the form groupId:artifactId:version; for example, com.databricks:spark-avro_2.10:1.0.0.
    • If you don’t know the exact coordinate, enter the library name and click Search Packages. A list of matching packages displays. To display details about a package, click its name. You can sort packages by name, organization, and rating. You can also filter the results by writing a query in the search bar. The results refresh automatically.
      1. Select Maven Central or Spark Packages in the drop-down list at the top left.
      2. Optionally select the package version in the Releases column.
      3. Click + Select next to a package. The Coordinate field is filled in with the selected package and version.
  3. (Optional) In the Repository field, you can enter a Maven repository URL.

    Note

    Internal Maven repositories are not supported.

  4. In the Exclusions field, optionally provide the groupId and the artifactId of the dependencies that you want to exclude (for example, log4j:log4j).

  5. Click Install or Create. See Installing libraries vs. creating workspace libraries.

CRAN package

  1. In the Library Source button list, select CRAN.
  2. In the Package field, enter the name of the package.
  3. (Optional) In the Repository field, you can enter the CRAN repository URL.
  4. Click Install or Create. See Installing libraries vs. creating workspace libraries.

Note

CRAN mirrors serve the latest version of a library. As a result, you may end up with different versions of an R package if you attach the library to different clusters at different times. To learn how to manage and fix R package versions on Databricks, see the Knowledge Base.