Spark library management
Applies to: SQL Server 2019 (15.x)
Important
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
This article provides guidance on how to import and install packages for a Spark session through session and notebook configurations.
Built-in tools
Scala Spark (Scala 2.12) and Hadoop base packages.
PySpark (Python 3.8). Pandas, Sklearn, Numpy, and other data processing and machine learning packages.
MRO 3.5.2 packages. Sparklyr and SparkR for R Spark workloads.
Install packages from a Maven repository onto the Spark cluster at runtime
Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:
%%configure -f \
{"conf": {"spark.jars.packages": "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.1"}}
Multiple packages and additional Spark configurations
In the following sample notebook cell, multiple packages are defined.
%%configure -f \
{
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.9.4,com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.1",
"spark.jars.repositories":"https://mmlspark.azureedge.net/maven"
}
}
Install Python packages at PySpark at runtime
Session and Job level package management guarantees library consistency and isolation. The configuration is a Spark standard library configuration that can be applied on Livy sessions. azdata spark support these configurations. The examples below are presented as Azure Data Studio Notebooks configure cells that need to be run after attaching to a cluster with the PySpark kernel.
If the "spark.pyspark.virtualenv.enabled" : "true" configuration is not set, the session will use the cluster default python and installed libraries.
Session/Job configuration with requirements.txt
Specify the path to a requirements.txt file in HDFS to use as a reference for packages to install.
%%configure -f \
{
"conf": {
"spark.pyspark.virtualenv.enabled" : "true",
"spark.pyspark.virtualenv.python_version": "3.8",
"spark.pyspark.virtualenv.requirements" : "hdfs://user/project-A/requirements.txt"
}
}
Session/Job configuration with different python versions
Create a conda virtualenv without a requirements file and dynamically add packages during the Spark session.
%%configure -f \
{
"conf": {
"spark.pyspark.virtualenv.enabled" : "true",
"spark.pyspark.virtualenv.python_version": "3.7"
}
}
Library installation
Execute the sc.install_packages to install libraries dynamically in your session. Libraries will be installed into the driver and across all executor nodes.
sc.install_packages("numpy==1.11.0")
import numpy as np
Is is also possible to install multiple libraries in the same command using an array.
sc.install_packages(["numpy==1.11.0", "xgboost"])
import numpy as np
import xgboost as xgb
Import .jar from HDFS for use at runtime
Import jar at runtime through Azure Data Studio notebook cell configuration.
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
Next steps
For more information on SQL Server big data cluster and related scenarios, See SQL Server Big Data Clusters.