Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Apache Spark in Azure Synapse Analytics enables machine learning with big data, providing the ability to obtain valuable insight from large amounts of structured, unstructured, and fast-moving data. There are several options when training machine learning models using Azure Spark in Azure Synapse Analytics: Apache Spark MLlib, Azure Machine Learning, and various other open-source libraries.
Note
The Preview for Azure Synapse GPU-enabled pools has now been deprecated.
GPU-enabled Apache Spark pools
To simplify the process for creating and managing pools, Azure Synapse takes care of pre-installing low-level libraries and setting up all the complex networking requirements between compute nodes. This integration allows users to get started with GPU- accelerated pools within just a few minutes.
Note
- GPU-accelerated pools can be created in workspaces located in East US, Australia East, and North Europe.
- GPU-accelerated pools are only available with the Apache Spark 3.1 (deprecated) and 3.2 runtime (deprecated).
- You might need to request a limit increase in order to create GPU-enabled clusters.
GPU ML Environment
Azure Synapse Analytics provides built-in support for deep learning infrastructure. The Azure Synapse Analytics runtimes for Apache Spark 3 include support for the most common deep learning libraries like TensorFlow and PyTorch. The Azure Synapse runtime also includes supporting libraries like Petastorm and Horovod which are commonly used for distributed training.
TensorFlow
TensorFlow is an open source machine learning framework for all developers. It is used for implementing machine learning and deep learning applications.
For more information about TensorFlow, you can visit the TensorFlow API documentation.
PyTorch
PyTorch is an optimized tensor library for deep learning using GPUs and CPUs.
For more information about PyTorch, you can visit the PyTorch documentation.
Horovod
Horovod is a distributed deep learning training framework for TensorFlow, Keras, and PyTorch. Horovod was developed to make distributed deep learning fast and easy to use. With this framework, an existing training script can be scaled up to run on hundreds of GPUs in just a few lines of code. In addition, Horovod can run on top of Apache Spark, making it possible to unify data processing and model training into a single pipeline.
To learn more about how to run distributed training jobs in Azure Synapse Analytics, you can visit the following tutorials: - Tutorial: Distributed training with Horovod and PyTorch - Tutorial: Distributed training with Horovod and TensorFlow
For more information about Horovod, you can visit the Horovod documentation,
Petastorm
Petastorm is an open source data access library which enables single-node or distributed training of deep learning models. This library enables training directly from datasets in Apache Parquet format and datasets that have already been loaded as an Apache Spark DataFrame. Petastorm supports popular training frameworks such as TensorFlow and PyTorch.
For more information about Petastorm, you can visit the Petastorm GitHub page or the Petastorm API documentation.
Next steps
This article provides an overview of the various options to train machine learning models within Apache Spark pools in Azure Synapse Analytics. You can learn more about model training by following the tutorial below:
- Run SparkML experiments: Apache SparkML Tutorial
- Accelerate ETL workloads with RAPIDS: Apache Spark Rapids