This reference architecture shows how to conduct distributed training of deep learning models across clusters of GPU-enabled VMs. The scenario is image classification, but the solution can be generalized to other deep learning scenarios such as segmentation or object detection.
A reference implementation for this architecture is available on GitHub.
Download a Visio file of this architecture.
This architecture consists of the following services:
Azure Machine Learning Compute plays the central role in this architecture by scaling resources up and down according to need. Azure Machine Learning Compute is a service that helps provision and manage clusters of VMs, schedule jobs, gather results, scale resources, and handle failures. It supports GPU-enabled VMs for deep learning workloads.
Standard Blob Storage is used to store the logs and results. Premium Blob Storage is used to store the training data and is mounted in the nodes of the training cluster by using blobfuse. The Premium tier of Blob Storage offers better performance than the Standard tier and is recommended for distributed training scenarios. When mounted by using blobfuse, during the first epoch, the training data is downloaded to the local disks of the training cluster and cached. For every subsequent epoch, the data is read from the local disks, which is the most performant option.
Azure Container Registry is used to store the Docker image that Azure Machine Learning Compute uses to run the training.
- Azure Machine Learning is an open platform for managing the development and deployment of machine-learning models at scale. The platform supports commonly used open frameworks and offers automated featurization and algorithm selection. You can use Machine Learning to deploy models to various targets, including Azure Container Instances.
- Azure Blob Storage is a service that's part of Azure Storage. Blob Storage offers optimized cloud object storage for large amounts of unstructured data.
- Container Registry is a cloud-based, private registry service. You can use Container Registry to store and manage private Docker container images and related artifacts.
Scenario: Classifying images is a widely applied technique in computer vision, often tackled by training a convolutional neural network (CNN). For particularly large models with large datasets, the training process can take weeks or months on a single GPU. In some situations, the models are so large that it's not possible to fit reasonable batch sizes onto the GPU. Using distributed training in these situations can shorten the training time.
In this specific scenario, a ResNet50 CNN model is trained using Horovod on the ImageNet dataset and on synthetic data. The reference implementation shows how to accomplish this task using TensorFlow.
There are several ways to train a deep learning model in a distributed fashion, including data-parallel and model-parallel approaches that are based on synchronous or asynchronous updates. Currently the most common scenario is data-parallel training with synchronous updates. This approach is the easiest to implement and is sufficient for most use cases.
In data-parallel distributed training with synchronous updates, the model is replicated across n hardware devices. A mini-batch of training samples is divided into n micro-batches. Each device performs forward and backward passes for a micro-batch. When a device finishes the process, it shares the updates with the other devices. These values are used to calculate the updated weights of the entire mini-batch, and the weights are synchronized across the models. This scenario is covered in the associated GitHub repository.
This architecture can also be used for model-parallel and asynchronous updates. In model-parallel distributed training, the model is divided across n hardware devices, with each device holding a part of the model. In the simplest implementation, each device holds a layer of the network, and information is passed between devices during the forward and backward passes. Larger neural networks can be trained this way, but at the cost of performance, because devices are constantly waiting for each other to complete either the forward or backward pass. Some advanced techniques try to partially alleviate this issue by using synthetic gradients.
The steps for training are:
- Create scripts that run on the cluster and train your model.
- Write training data to Blob Storage.
- Create a Machine Learning workspace. This step also creates an instance of Container Registry to host your Docker images.
- Create a Machine Learning GPU-enabled cluster.
- Submit training jobs. For each job with unique dependencies, a new Docker image is built and pushed to your container registry. During execution, the appropriate Docker image runs and executes your script.
- All the results and logs are written to Blob Storage.
Training cluster considerations
Azure provides several GPU-enabled VM types that are suitable for training deep learning models. They range in price and speed from low to high as follows:
|Azure VM series||NVIDIA GPU|
|NDv2||8x V100 (NVLink)|
|ND A100 v4||8x A100 (NVLink)|
We recommend scaling up your training before scaling out. For example, try a single V100 before trying a cluster of K80s. Similarly, consider using a single NDv2 instead of eight NCsv3 VMs.
The following graph shows the performance differences for different GPU types based on benchmarking tests carried out using TensorFlow and Horovod. The graph shows throughput of 32 GPU clusters across various models, on different GPU types and MPI versions. Models were implemented in TensorFlow 1.9
Each VM series shown in the previous table includes a configuration with InfiniBand. Use the InfiniBand configurations when you run distributed training, for faster communication between nodes. InfiniBand also increases the scaling efficiency of the training for the frameworks that can take advantage of it. For details, see the Infiniband benchmark comparison.
These considerations implement the pillars of the Azure Well-Architected Framework, which is a set of guiding tenets that you can use to improve the quality of a workload. For more information, see Microsoft Azure Well-Architected Framework.
When you train deep learning models, an often overlooked aspect is where to store the training data. If the storage is too slow to keep up with the demands of the GPUs, training performance can degrade.
Azure Machine Learning Compute supports many storage options. For best performance, download the data locally to each node. However, this process can be cumbersome, because you have to download the data to each node from Blob Storage. With the ImageNet dataset, this process can take a considerable amount of time. By default, Machine Learning mounts storage so that it caches the data locally. As a result, in practice, after the first epoch, the data is read from local storage. Combined with Premium Blob Storage, this arrangement offers a good compromise between ease of use and performance.
Although Azure Machine Learning Compute can mount Standard tier Blob Storage using the blobfuse adapter, we don't recommend using the Standard tier for distributed training, because the performance typically isn't good enough to handle the necessary throughput. Use Premium tier as storage for training data, as shown earlier in the architecture diagram. For a blog post with a throughput and latency comparison between the two tiers, see Premium Block Blob Storage - a new level of performance.
Whenever a Machine Learning workspace is provisioned, a set of dependent resources—Blob Storage, Key Vault, Container Registry, and Application Insights—is also provisioned. Alternatively, you can use existing Azure resources and associate them with the new Machine Learning workspace during its creation.
By default, Basic tier Container Registry is provisioned. For large-scale deep learning, we recommend that you customize your workspace to use Premium tier Container Registry. It offers significantly higher bandwidth that allows you to quickly pull Docker images across nodes of your training cluster.
With large datasets, it's often advisable to use data formats such as TFRecords or Petastorm that provide better I/O performance than multiple small image files.
Security provides assurances against deliberate attacks and the abuse of your valuable data and systems. For more information, see Overview of the security pillar.
Use a High Business Impact-enabled workspace
In scenarios that use sensitive data, you should consider designating a Machine Learning workspace as High Business Impact (HBI) by setting an hbi_workspace flag to true when creating it. An HBI-enabled workspace, among others, encrypts local scratch disks of compute clusters, enables IP filtering, and reduces the amount of diagnostic data that Microsoft collects. For more information, see Data encryption with Azure Machine Learning.
Encrypt data at rest and in motion
Encrypt sensitive data at rest—that is, in the blob storage. Each time data moves from one location to the other, use SSL to secure the data transfer. For more information, see Azure Storage security guide.
Secure data in a virtual network
For production deployments, consider deploying the Machine Learning cluster into a subnet of a virtual network that you specify. With this setup, the compute nodes in the cluster can communicate securely with other virtual machines or with an on-premises network. You can also use service or private endpoints for all associated resources to grant access from a virtual network.
Cost optimization is about looking at ways to reduce unnecessary expenses and improve operational efficiencies. For more information, see Overview of the cost optimization pillar.
Use the Azure pricing calculator to estimate the cost of running your deep learning workload. For cost planning and management considerations that are specific to Machine Learning, see Plan to manage costs for Azure Machine Learning. For more information, see Overview of the cost optimization pillar.
Premium Blob Storage
Premium Blob Storage has a high data storage cost, but the transaction cost is lower than the cost of storing data in the Hot tier of Standard Blob Storage. So Premium Blob Storage can be less expensive for workloads with high transaction rates. For more information, see Azure Blob Storage pricing.
Container Registry offers Basic, Standard and Premium tiers. Choose a tier depending on the storage you need. Choose Premium if you need geo replication or enhanced throughput for Docker pulls across concurrent nodes. In addition, standard networking charges apply. For more information, see Azure Container Registry pricing.
Azure Machine Learning Compute
In this architecture, Azure Machine Learning Compute is likely the main cost driver. The implementation needs a cluster of GPU compute nodes. The price of those nodes is determined by their number and the VM size that you select. For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes and Azure Virtual Machines Pricing.
Typically, deep learning workloads track the progress after every epoch or every few epochs. This practice limits the impact of unexpected interruptions to the training. You can pair this practice with the use of low-priority VMs for Machine Learning compute clusters. Low-priority VMs use excess Azure capacity at significantly reduced rates, but they can be preempted if capacity demands increase.
Operational excellence covers the operations processes that deploy an application and keep it running in production. For more information, see Overview of the operational excellence pillar.
While running your job, it's important to monitor the progress and make sure that things are working as expected. However, it can be a challenge to monitor across a cluster of active nodes.
Machine Learning offers many ways to instrument your experiments. The stdout and stderr streams from your scripts are automatically logged. These logs are automatically synced to your workspace blob storage. You can either view these files through the Azure portal, or download or stream them using the Python SDK or Machine Learning CLI. If you log your experiments by using Tensorboard, these logs are automatically synced. You can access them directly or use the Machine Learning SDK to stream them to a Tensorboard session.
Performance efficiency is the ability of your workload to scale to meet the demands placed on it by users in an efficient manner. For more information, see Performance efficiency pillar overview.
The scaling efficiency of distributed training is always less than 100 percent due to network overhead—syncing the entire model between devices becomes a bottleneck. Therefore, distributed training is best suited for:
- Large models that can't be trained by using a reasonable batch size on a single GPU.
- Problems that can't be addressed by distributing the model in a simple, parallel way.
Distributed training isn't recommended for running hyperparameter searches. The scaling efficiency affects performance and makes a distributed approach less efficient than training multiple model configurations separately.
One way to increase scaling efficiency is to increase the batch size. But make this adjustment carefully. Increasing the batch size without adjusting the other parameters can impair the model's final performance.
Deploy this scenario
The reference implementation of this architecture is available on GitHub. Follow the steps described there to conduct distributed training of deep learning models across clusters of GPU-enabled VMs.
This article is maintained by Microsoft. It was originally written by the following contributors.
- Ilia Karmanov | Senior Applied Scientist
- Mathew Salvaris | Principal Data Scientist Lead
To see non-public LinkedIn profiles, sign in to LinkedIn.
- Azure Machine Learning Cheat Sheet - distributed GPU training
- Azure Machine Learning - distributed training examples
- What is Azure Machine Learning?
- Introduction to Container registries in Azure
- Introduction to Azure Blob Storage
- Train compute-intensive models with Azure Machine Learning
- Train and evaluate deep learning models
The output from this architecture is a trained model that is saved to a blob storage. You can operationalize this model for either real-time scoring or batch scoring. For more information, see the following reference architectures:
- Real-time scoring of Python scikit-learn and deep learning models on Azure
- Batch scoring on Azure for deep learning models
For architectures that involve distributed training or deep learning, see the following resources: