Manage Apache Spark libraries in Microsoft Fabric
A library is a collection of pre-written code that can be imported to provide extra functionality. By using libraries, developers can save time and effort by not having to write code from scratch to perform common tasks. Instead, they can import the library and use its functions and classes to achieve their desired functionality. On Microsoft Fabric, multiple mechanisms are provided to help you manage and use the libraries.
- Built-in libraries: Each Fabric Spark runtime provides a rich set of popular preinstalled libraries. You can find the full built-in library list in Fabric Spark Runtime
- Public library: Public libraries are sourced from repositories such as PyPI and Conda, which are currently supported.
- Custom library: Custom libraries refer to code built by you or your organization, and are supported in the .whl, .jar, and .tar.gz formats. The .tar.gz format are only supported for R language. For Python custom libraries, use the .whl format.
Library management in workspace setting
Important
Library management at the workspace setting is no longer supported. You can follow "Migrate the workspace libraries and Spark properties to a default environment" to migrate them to an environment and attach as workspace default.
Summary of library management best practices
Scenario 1: admin sets default libraries for the workspace
In order to set default libraries, you have to be the admin of the workspace. You can create a new environment, install the required libraries, and then attach this environment as workspace default in workspace setting.
The notebooks and Spark job definitions in the workspace, which are attached to the Workspace Settings, start sessions with libraries installed in the workspace's default environment.
Scenario 2: persist library specifications for one or multiple code items
You can install the libraries in an environment and attach it to the code items if you want to persist the library specifications.
One benefit of doing so is that it saves duplicated effort if running the code requires common libraries all the time. Once successfully installed in the environment, they are effective in all Spark sessions if the environment is attached.
Another benefit is that library configuration granularity lower than the workspace level is supported. One environment can be attached to multiple code artifacts. If you have a subset of notebooks or Spark job definitions in one workspace that require the same libraries, attach them to the same environment. Admin, member, and contributor of the workspace can create, edit, and attach the environment.
Scenario 3: in-line installation in interactive run
If you want to use a library that isn't installed for one-time use in an interactive notebook run, in-line installation is the most convenient option. In-line commands in Fabric allow you to have the library effective in the current notebook Spark session but it doesn't persist across different sessions.
Users, who have the permission to run the notebook, can install extra libraries in the Spark session.
Summary of supported library types
Library type | Environment library management | In-line installation |
---|---|---|
Python Public (PyPI & Conda) | Supported | Supported |
Python Custom (.whl) | Supported | Supported |
R Public (CRAN) | Not supported | Supported |
R custom (.tar.gz) | Supported | Supported |
Jar | Supported as custom library | Not supported |
Important
We currently have limitations on the .jar library.
- If you upload a .jar file with a different version of a built-in library, it will not be effective. Only the new .jar will be effective for your Spark sessions.
- %% configure magic commands are currently not fully supported on Fabric. Don't use them to bring .jar files to your notebook session.
In-line installation
In-line commands support Python libraries and R libraries.
Python in-line installation
Important
The Python interpreter is restarted to apply the change of libraries. Any variables defined before running the command cell are lost. Therefore, we strongly recommend you to put all the commands for adding, deleting, or updating Python packages at the beginning of your notebook. %pip is recommended instead of !pip. !pip is a IPython built-in shell command which has following limitations:
- !pip will only install package on driver node without executor nodes.
- Packages that install through !pip will not affect when conflicts with built-in packages or when it's already imported in a notebook.
However, %pip will handle all above mentioned scenarios. Libraries installed through %pip will be available on both driver and executor nodes and will be still effective even it's already imported.
Tip
- The %conda install command usually takes longer than the %pip install command to install new Python libraries, because it checks the full dependencies and resolves conflicts. You may want to use %conda install for more reliability and stability. You can use %pip install if you are sure that the library you want to install does not conflict with the preinstalled libraries in the runtime environment.
- All available Python in-line commands and its clarifications can be found: %pip commands and %conda commands
Manage Python public libraries through in-line installation
In this example, we show you how to use in-line commands to manage libraries. Suppose you want to use altair, a powerful visualization library for Python, for a one-time data exploration. And suppose the library isn't installed in your workspace. In the following example, we use conda commands to illustrate the steps.
You can use in-line commands to enable altair on your notebook session without affecting other sessions of the notebook or other items.
Run the following commands in a notebook code cell to install the altair library and vega_datasets, which contains semantic model you can use to visualize:
%conda install altair # install latest version through conda command %conda install vega_datasets # install latest version through conda command
The output of the cell output indicates the result of installation.
Import the package and semantic model by running the following codes in another notebook cell:
import altair as alt from vega_datasets import data
Now you can play around with the session-scoped altair library:
# load a simple dataset as a pandas DataFrame cars = data.cars() alt.Chart(cars).mark_point().encode( x='Horsepower', y='Miles_per_Gallon', color='Origin', ).interactive()
Manage Python custom libraries through in-line installation
You can upload your Python custom libraries to the File folder of the lakehouse attached to your notebook. Go to your lakehouse, select the … icon on the File folder, and upload the custom library.
After uploading, you can use the following command to install the custom library to your notebook session:
# install the .whl through pip command
%pip install /lakehouse/default/Files/wheel_file_name.whl
R in-line installation
Fabric supports install.packages(), remove.packages() and devtools:: commands to manage R libraries.
Tip
Find all available R in-line commands and clarifications in install.packages command, remove.package command, and devtools commands.
Manage R public libraries through in-line installation
Follow this example to walk through the steps of installing an R public library:
To install an R feed library:
Switch the working language to “SparkR(R)” in the notebook ribbon.
install the caesar library by running the following command in a notebook cell.
install.packages("caesar")
Now you can play around with the session-scoped caesar library with a Spark job.
library(SparkR) sparkR.session() hello <- function(x) { library(caesar) caesar(x) } spark.lapply(c("hello world", "good morning", "good evening"), hello)
Related content
Feedback
Submit and view feedback for