Distributed GPU training guide (SDK v2)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

This article describes how to use distributed GPU training code in Azure Machine Learning. You see how to run existing code with tips and examples for PyTorch, DeepSpeed, and TensorFlow. You also learn about accelerating distributed GPU training with InfiniBand.

Tip

More than 90% of the time, you should use distributed data parallelism as your distributed parallelism type.

Prerequisite

Understanding of the basic concepts of distributed GPU training, such as data parallelism, distributed data parallelism, and model parallelism.
The Azure Machine Learning SDK for Python with the azure-ai-ml version 1.5.0 or later package installed.

PyTorch

Azure Machine Learning supports running distributed jobs by using PyTorch's native distributed training capabilities, torch.distributed.

For data parallelism, the official PyTorch guidance advises using DistributedDataParallel (DDP) over DataParallel for both single-node and multinode distributed training. PyTorch also recommends using DistributedDataParallel over the multiprocessing package. Therefore, Azure Machine Learning documentation and examples focus on DistributedDataParallel training.

Process group initialization

The backbone of any distributed training is a group of processes that recognize and can communicate with each other by using a backend. For PyTorch, you create the process group by calling torch.distributed.init_process_group in all distributed processes to collectively form a process group.

torch.distributed.init_process_group(backend='nccl', init_method='env://', ...)

The most common communication backends are mpi, nccl, and gloo. For GPU-based training, use nccl for the best performance.

To run distributed PyTorch on Azure Machine Learning, use the init_method parameter in your training code. This parameter specifies how each process discovers the other processes and how they initialize and verify the process group by using the communication backend. By default, if you don't specify init_method, PyTorch uses the environment variable initialization method env://.

PyTorch looks for the following environment variables for initialization:

MASTER_ADDR: IP address of the machine that hosts the process with rank 0.
MASTER_PORT: A free port on the machine that hosts the process with rank 0.
WORLD_SIZE: The total number of processes. Should be equal to the total number of GPU devices used for distributed training.
RANK: The global rank of the current process. The possible values are 0 to (<world size> - 1).

For more information on process group initialization, see the PyTorch documentation.

Environment variables

Many applications also need the following environment variables:

LOCAL_RANK: The local relative rank of the process within the node. The possible values are 0 to (<# of processes on the node> - 1). This information is useful because many operations such as data preparation only need to be performed once per node, usually on local_rank = 0.
NODE_RANK: The rank of the node for multinode training. The possible values are 0 to (<total # of nodes> - 1).

Run a distributed PyTorch job

You don't need to use a launcher utility like torch.distributed.launch to run a distributed PyTorch job. You can:

Specify the training script and arguments.
Create a command, and specify the type as PyTorch and the process_count_per_instance in the distribution parameter.

The process_count_per_instance corresponds to the total number of processes you want to run for your job, and should typically equal the number of GPUs per node. If you don't specify process_count_per_instance, Azure Machine Learning launches one process per node by default.

Azure Machine Learning sets the MASTER_ADDR, MASTER_PORT, WORLD_SIZE, and NODE_RANK environment variables on each node, and sets the process-level RANK and LOCAL_RANK environment variables.

PyTorch example

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths

inputs = {
    "cifar": Input(
        type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
    ),  # path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"), #path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/"),
    "epoch": 10,
    "batchsize": 64,
    "workers": 2,
    "lr": 0.01,
    "momen": 0.9,
    "prtfreq": 200,
    "output": "./outputs",
}

from azure.ai.ml.entities import ResourceConfiguration

job = command(
    code="./src",  # local path where the code is stored
    command="python train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers ${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
    inputs=inputs,
    environment="azureml:AzureML-acpt-pytorch-2.8-cuda12.6@latest",
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        # NC6s_v3 has only 1 GPU
        "process_count_per_instance": 1,
    },
)
job.resources = ResourceConfiguration(
    instance_type="STANDARD_NC4AS_T4_V3", instance_count=2
)  # Serverless compute resources

For the full notebook to run the PyTorch example, see azureml-examples: Distributed training with PyTorch on CIFAR-10.

DeepSpeed

Azure Machine Learning supports DeepSpeed as a top-level feature to run distributed jobs with near linear scalability for increased model size and increased GPU numbers.

You can enable DeepSpeed for running distributed training by using either PyTorch distribution or Message Passing Interface (MPI). Azure Machine Learning supports the DeepSpeed launcher to launch distributed training, and autotuning to get optimal ds configuration.

You can use a curated environment with the latest state-of-the-art technologies, including DeepSpeed, ONNX (Open Neural Network Exchange) Runtime (ORT), Microsoft Collective Communication Library (MSSCCL), and PyTorch, for your DeepSpeed training jobs.

DeepSpeed example

For DeepSpeed training and autotuning examples, see https://github.com/Azure/azureml-examples/cli/jobs/deepspeed.

TensorFlow

If you use native distributed TensorFlow in your training code, such as the TensorFlow 2.x tf.distribute.Strategy API, you can launch the distributed job via Azure Machine Learning by using distribution parameters or the TensorFlowDistribution object.

# create the command
job = command(
    code="./src",  # local path where the code is stored
    command="python main.py --epochs ${{inputs.epochs}} --model-dir ${{inputs.model_dir}}",
    inputs={"epochs": 1, "model_dir": "outputs/keras-model"},
    environment="AzureML-tensorflow-2.16-cuda12@latest",
    compute="cpu-cluster",
    instance_count=2,
    # distribution = {"type": "mpi", "process_count_per_instance": 1},
    # distribution={
    #     "type": "tensorflow",
    #     "parameter_server_count": 1,  # for legacy TensorFlow 1.x
    #     "worker_count": 2,
    #     "added_property": 7,
    # },
    # distribution = {
    #        "type": "pytorch",
    #        "process_count_per_instance": 4,
    #        "additional_prop": {"nested_prop": 3},
    #    },
    display_name="tensorflow-mnist-distributed-example"
    # experiment_name: tensorflow-mnist-distributed-example
    # description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via TensorFlow.
)

# can also set the distribution in a separate step and using the typed objects instead of a dict
job.distribution = TensorFlowDistribution(worker_count=2)

If your training script uses the parameter server strategy for distributed training, such as for legacy TensorFlow 1.x, you also need to specify the number of parameter servers to use in the job. In the preceding example, you specify "parameter_server_count" : 1 and "worker_count": 2 inside the distribution parameter of the command.

TF_CONFIG

To train on multiple machines in TensorFlow, you use the TF_CONFIG environment variable. For TensorFlow jobs, Azure Machine Learning sets the TF_CONFIG variable correctly for each worker before running your training script.

You can access TF_CONFIG from your training script if you need to by using os.environ['TF_CONFIG'].

The following example sets TF_CONFIG on a chief worker node:

TF_CONFIG='{
    "cluster": {
        "worker": ["host0:2222", "host1:2222"]
    },
    "task": {"type": "worker", "index": 0},
    "environment": "cloud"
}'

TensorFlow example

For the full notebook to run a TensorFlow example, see tensorflow-mnist-distributed-example.

InfiniBand

As you increase the number of virtual machines (VMs) that train a model, the time required to train that model should decrease in linear proportion to the number of training VMs. For instance, if training a model on one virtual machine (VM) takes 100 seconds, training the same model on two VMs should take 50 seconds, training the model on four VMs should take 25 seconds, and so on.

InfiniBand can help you attain this linear scaling by enabling low-latency, GPU-to-GPU communication across nodes in a cluster. InfiniBand requires specialized hardware to operate, such as the Azure VM NC-, ND-, or H-series. These VMs have Remote Direct Memory Access (RDMA)-capable VMs with Single Root I/O Virtualization (SR-IOV) and InfiniBand support.

These VMs communicate over the low-latency and high-bandwidth InfiniBand network, which performs better than Ethernet-based connectivity. SR-IOV for InfiniBand enables near bare-metal performance for any MPI library, as used by many distributed training frameworks and tools like NVIDIA Collective Communications Library (NCCL).

These Stock Keeping Units (SKUs) are intended to meet the needs of computationally intensive, GPU-accelerated machine-learning workloads. For more information, see Accelerating Distributed Training in Azure Machine Learning with SR-IOV.

Typically, only VM SKUs with r in their names, referring to RDMA, contain the required InfiniBand hardware. For instance, the VM SKU Standard_NC24rs_v3 is InfiniBand-enabled, but Standard_NC24s_v3 isn't. The specs for these two SKUs are largely the same except for the InfiniBand capabilities. Both SKUs have 24 cores, 448-GB RAM, 4 GPUs of the same SKU, and so on. For more information about RDMA- and InfiniBand-enabled machine SKUs, see High performance compute.

Note

The older-generation machine SKU Standard_NC24r is RDMA-enabled, but doesn't contain the SR-IOV hardware required for InfiniBand.

If you create an AmlCompute cluster using one of these RDMA-capable, InfiniBand-enabled sizes, the OS image comes with the Mellanox OpenFabrics Enterprise Distribution (OFED) driver required to enable InfiniBand preinstalled and preconfigured.

Feedback

Was this page helpful?

Last updated on 2026-01-15