Spark library management
Applies to:
SQL Server 2019 (15.x)
Importante
The Microsoft SQL Server 2019 Big Data Clusters add-on will be retired. Support for SQL Server 2019 Big Data Clusters will end on February 28, 2025. All existing users of SQL Server 2019 with Software Assurance will be fully supported on the platform and the software will continue to be maintained through SQL Server cumulative updates until that time. For more information, see the announcement blog post and Big data options on the Microsoft SQL Server platform.
This article provides guidance on how to import and install packages for a Spark session through session and notebook configurations.
Scala Spark (Scala 2.12) and Hadoop base packages.
PySpark (Python 3.8). Pandas, Sklearn, Numpy, and other data processing and machine learning packages.
MRO 3.5.2 packages. Sparklyr and SparkR for R Spark workloads.
Maven packages can be installed onto your Spark cluster using notebook cell configuration at the start of your spark session. Before starting a spark session in Azure Data Studio, run the following code:
%%configure -f \
{"conf": {"spark.jars.packages": "com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.1"}}
In the following sample notebook cell, multiple packages are defined.
%%configure -f \
{
"conf": {
"spark.jars.packages": "com.microsoft.azure:synapseml_2.12:0.9.4,com.microsoft.azure:azure-eventhubs-spark_2.12:2.3.1",
"spark.jars.repositories":"https://mmlspark.azureedge.net/maven"
}
}
Session and Job level package management guarantees library consistency and isolation. The configuration is a Spark standard library configuration that can be applied on Livy sessions. azdata spark support these configurations. The examples below are presented as Azure Data Studio Notebooks configure cells that need to be run after attaching to a cluster with the PySpark kernel.
If the "spark.pyspark.virtualenv.enabled" : "true" configuration is not set, the session will use the cluster default python and installed libraries.
Specify the path to a requirements.txt file in HDFS to use as a reference for packages to install.
%%configure -f \
{
"conf": {
"spark.pyspark.virtualenv.enabled" : "true",
"spark.pyspark.virtualenv.python_version": "3.8",
"spark.pyspark.virtualenv.requirements" : "hdfs://user/project-A/requirements.txt"
}
}
Create a conda virtualenv without a requirements file and dynamically add packages during the Spark session.
%%configure -f \
{
"conf": {
"spark.pyspark.virtualenv.enabled" : "true",
"spark.pyspark.virtualenv.python_version": "3.7"
}
}
Execute the sc.install_packages to install libraries dynamically in your session. Libraries will be installed into the driver and across all executor nodes.
sc.install_packages("numpy==1.11.0")
import numpy as np
Is is also possible to install multiple libraries in the same command using an array.
sc.install_packages(["numpy==1.11.0", "xgboost"])
import numpy as np
import xgboost as xgb
Import jar at runtime through Azure Data Studio notebook cell configuration.
%%configure -f
{"conf": {"spark.jars": "/jar/mycodeJar.jar"}}
For more information on SQL Server big data cluster and related scenarios, See SQL Server Big Data Clusters.