Notebook-scoped Python libraries
Notebook-scoped libraries let you create, modify, save, reuse, and share custom Python environments that are specific to a notebook. When you install a notebook-scoped library, only the current notebook and any jobs associated with that notebook have access to that library. Other notebooks attached to the same cluster are not affected.
Notebook-scoped libraries do not persist across sessions. You must reinstall notebook-scoped libraries at the beginning of each session, or whenever the notebook is detached from a cluster.
Databricks recommends using the %pip
magic command to install notebook-scoped Python libraries.
You can use %pip
in notebooks scheduled as jobs. If you need to manage the Python environment in a Scala, SQL, or R notebook, use the %python
magic command in conjunction with %pip
.
You might experience more traffic to the driver node when working with notebook-scoped library installs. See How large should the driver node be when working with notebook-scoped libraries?.
To install libraries for all notebooks attached to a cluster, use cluster libraries. See Cluster libraries.
Note
On Databricks Runtime 10.4 LTS and below, you can use the (legacy) Azure Databricks library utility. The library utility is supported only on Databricks Runtime, not Databricks Runtime ML. See Library utility (dbutils.library) (legacy).
Manage libraries with %pip
commands
The %pip
command is equivalent to the pip command and supports the same API. The following sections show examples of how you can use %pip
commands to manage your environment. For more information on installing Python packages with pip
, see the pip install documentation and related pages.
Important
- Starting with Databricks Runtime 13.0
%pip
commands do not automatically restart the Python process. If you install a new package or update an existing package, you may need to usedbutils.library.restartPython()
to see the new packages. See Restart the Python process on Azure Databricks. - On Databricks Runtime 12.2 LTS and below, Databricks recommends placing all
%pip
commands at the beginning of the notebook. The notebook state is reset after any%pip
command that modifies the environment. If you create Python methods or variables in a notebook, and then use%pip
commands in a later cell, the methods or variables are lost. - Upgrading, modifying, or uninstalling core Python packages (such as IPython) with
%pip
may cause some features to stop working as expected. If you experience such problems, reset the environment by detaching and re-attaching the notebook or by restarting the cluster.
Install a library with %pip
%pip install matplotlib
Install a Python wheel package with %pip
%pip install /path/to/my_package.whl
Uninstall a library with %pip
Note
You cannot uninstall a library that is included in Databricks Runtime release notes versions and compatibility or a library that has been installed as a cluster library. If you have installed a different library version than the one included in Databricks Runtime or the one installed on the cluster, you can use %pip uninstall
to revert the library to the default version in Databricks Runtime or the version installed on the cluster, but you cannot use a %pip
command to uninstall the version of a library included in Databricks Runtime or installed on the cluster.
%pip uninstall -y matplotlib
The -y
option is required.
Install a library from a version control system with %pip
%pip install git+https://github.com/databricks/databricks-cli
You can add parameters to the URL to specify things like the version or git subdirectory. See the VCS support for more information and for examples using other version control systems.
Install a private package with credentials managed by Databricks secrets with %pip
Pip supports installing packages from private sources with basic authentication, including private version control systems and private package repositories, such as Nexus and Artifactory. Secret management is available via the Databricks Secrets API, which allows you to store authentication tokens and passwords. Use the DBUtils API to access secrets from your notebook. Note that you can use $variables
in magic commands.
To install a package from a private repository, specify the repository URL with the --index-url
option to %pip install
or add it to the pip
config file at ~/.pip/pip.conf
.
token = dbutils.secrets.get(scope="scope", key="key")
%pip install --index-url https://<user>:$token@<your-package-repository>.com/<path/to/repo> <package>==<version> --extra-index-url https://pypi.org/simple/
Similarly, you can use secret management with magic commands to install private packages from version control systems.
token = dbutils.secrets.get(scope="scope", key="key")
%pip install git+https://<user>:$token@<gitprovider>.com/<path/to/repo>
Install a package from DBFS with %pip
Important
Any workspace user can modify files stored in DBFS. Azure Databricks recommends storing files in workspaces or on Unity Catalog volumes.
You can use %pip
to install a private package that has been saved on DBFS.
When you upload a file to DBFS, it automatically renames the file, replacing spaces, periods, and hyphens with underscores. For Python wheel files, pip
requires that the name of the file use periods in the version (for example, 0.1.0) and hyphens instead of spaces or underscores, so these filenames are not changed.
%pip install /dbfs/mypackage-0.0.1-py3-none-any.whl
Install a package from a volume with %pip
Important
This feature is in Public Preview.
With Databricks Runtime 13.3 LTS and above, you can use %pip
to install a private package that has been saved to a volume.
When you upload a file to a volume, it automatically renames the file, replacing spaces, periods, and hyphens with underscores. For Python wheel files, pip
requires that the name of the file use periods in the version (for example, 0.1.0) and hyphens instead of spaces or underscores, so these filenames are not changed.
%pip install /Volumes/<catalog>/<schema>/<path-to-library>/mypackage-0.0.1-py3-none-any.whl
Install a package stored as a workspace file with %pip
With Databricks Runtime 11.3 LTS and above, you can use %pip
to install a private package that has been saved as a workspace file.
%pip install /Workspace/<path-to-whl-file>/mypackage-0.0.1-py3-none-any.whl
Save libraries in a requirements file
%pip freeze > /Workspace/shared/prod_requirements.txt
Any subdirectories in the file path must already exist. If you run %pip freeze > /Workspace/<new-directory>/requirements.txt
, the command fails if the directory /Workspace/<new-directory>
does not already exist.
Use a requirements file to install libraries
A requirements file contains a list of packages to be installed using pip
. An example of using a requirements file is:
%pip install -r /Workspace/shared/prod_requirements.txt
See Requirements File Format for more information on requirements.txt
files.
How large should the driver node be when working with notebook-scoped libraries?
Using notebook-scoped libraries might result in more traffic to the driver node as it works to keep the environment consistent across executor nodes.
When you use a cluster with 10 or more nodes, Databricks recommends these specs as a minimum requirement for the driver node:
- For a 100 node CPU cluster, use Standard_DS5_v2.
- For a 10 node GPU cluster, use Standard_NC12.
For larger clusters, use a larger driver node.
Can I use %sh pip
, !pip
, or pip
? What is the difference?
%sh
and !
execute a shell command in a notebook; the former is a Databricks auxiliary magic command while the latter is a feature of IPython. pip
is a shorthand for %pip
when automagic is enabled, which is the default in Azure Databricks Python notebooks.
On Databricks Runtime 11.3 LTS and above, %pip
, %sh pip
, and !pip
all install a library as a notebook-scoped Python library. On Databricks Runtime 10.4 LTS and below, Databricks recommends using only %pip
or pip
to install notebook-scoped libraries. The behavior of %sh pip
and !pip
is not consistent in Databricks Runtime 10.4 LTS and below.
Known issues
- On Databricks Runtime 9.1 LTS, notebook-scoped libraries are incompatible with batch streaming jobs. Databricks recommends using cluster libraries or the IPython kernel instead.