Machine learning with Apache Spark

2025-01-01

Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data.

This section includes an overview and tutorials for machine learning workflows, including exploratory data analysis, feature engineering, model training, model scoring, and deployment.

Synapse Runtime

The Synapse Runtime is a curated environment optimized for data science and machine learning. The Synapse Runtime offers a range of popular open-source libraries and builds in the Azure Machine Learning SDK by default. The Synapse Runtime also includes many external libraries, including PyTorch, Scikit-Learn, XGBoost, and more.

Learn more about the available libraries and related versions by viewing the published Azure Synapse Analytics runtime.

Exploratory data analysis

When using Apache Spark in Azure Synapse Analytics, there are various built-in options to help you visualize your data, including Synapse notebook chart options, access to popular open-source libraries like Seaborn and Matplotlib, as well as integration with Synapse SQL and Power BI.

Learn more about the data visualization and data analysis options by viewing the article on how to visualize data using Azure Synapse Notebooks.

Feature engineering

By default, the Synapse Runtime includes a set of libraries that are commonly used for feature engineering. For large datasets, you can use Spark SQL, MLlib, and Koalas for feature engineering. For smaller datasets, third-party libraries like NumPy, Pandas, and Scikit-learn also provide useful methods for these scenarios.

Train models

There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.

Learn more about the machine learning capabilities by viewing the article on how to train models in Azure Synapse Analytics.

SparkML and MLlib

Spark's in-memory distributed computation capabilities make it a good choice for the iterative algorithms used in machine learning and graph computations. spark.ml provides a uniform set of high-level APIs that help users create and tune machine learning pipelines. To learn more about spark.ml, you can visit the Apache Spark ML programming guide.

Open-source libraries

Every Apache Spark pool in Azure Synapse Analytics comes with a set of pre-loaded and popular machine learning libraries. Some of the relevant machine learning libraries that are included by default include:

Scikit-learn is one of the most popular single-node machine learning libraries for classical ML algorithms. Scikit-learn supports most of the supervised and unsupervised learning algorithms and can also be used for data-mining and data-analysis.
XGBoost is a popular machine learning library that contains optimized algorithms for training decision trees and random forests.
PyTorch & TensorFlow are powerful Python deep learning libraries. Within an Apache Spark pool in Azure Synapse Analytics, you can use these libraries to build single-machine models by setting the number of executors on your pool to zero. Even though Apache Spark is not functional under this configuration, it is a simple and cost-effective way to create single-machine models.

Track model development

MLFlow is an open-source library for managing the life cycle of your machine learning experiments. MLFlow Tracking is a component of MLflow that logs and tracks your training run metrics and model artifacts. To learn more about how you can use MLFlow Tracking through Azure Synapse Analytics and Azure Machine Learning, visit this tutorial on how to use MLFlow.

Model scoring

Model scoring, or inferencing, is the phase where a model is used to make predictions. For model scoring with SparkML or MLLib, you can leverage the native Spark methods to perform inferencing directly on a Spark DataFrame. For other open-source libraries and model types, you can also create a Spark UDF to scale out inference on large datasets. For smaller datasets, you can also use the native model inference methods provided by the library.

Register and serve models

Registering a model allows you to store, version, and track metadata about models in your workspace. After you have finished training your model, you can register your model to the Azure Machine Learning model registry. Once registered, ONNX models can also be used to enrich the data stored in dedicated SQL pools.

Next steps

To get started with machine learning in Azure Synapse Analytics, be sure to check out the following tutorials:

Share via