NEW REFERENCE ARCHITECTURE: Distributed training of deep learning models on Azure

The AzureCAT blog has moved! Find this blog post over on our new blog at the Microsoft Tech Community:


Our sixth AI reference architecture (on the Azure Architecture Center) is authored by AzureCAT Mathew Salvaris, edited by Nanette Ray, and published by Mike Wasson.

Reference architectures provide a consistent approach and best practices for a given solution. Each architecture includes recommended practices, along with considerations for scalability, availability, manageability, security, and more. This architecture includes a deployable solution as well. The full array of reference architectures is available on the Azure Architecture Center.

Distributed training of deep learning models on Azure

This reference architecture shows how to conduct distributed training of deep learning models across clusters of GPU-enabled virtual machines (VMs). The scenario is image classification, but the solution can be generalized for other deep-learning scenarios, such as segmentation and object detection.

This architecture consists of the following components:

  • Azure Batch AI plays the central role in this architecture by scaling resources up and down according to need.
  • Blob storage is used to stage the data.
  • Azure Files is used to store the scripts, logs, and the final results from the training.
  • Batch AI file server is a single-node NFS share used in this architecture to store the training data.
  • Docker Hub is used to store the Docker image that Batch AI uses to run the training. Azure Container Registry can also be used.


Topics covered include:

Head over to the Azure Architecture Center to learn more about the Distributed training of deep learning models on Azure reference architecture.


See Also

Additional related AI reference architectures:

Find all our reference architectures here.


AzureCAT Guidance

"Hands-on solutions, with our heads in the Cloud!"