分散式 GPU 定型指南 (SDK v2)

發行項
06/13/2024

適用於：Python SDK azure-ai-ml v2 (目前)

深入了解在 Azure Machine Learning 中使用分散式 GPU 定型程式碼。本文可協助您執行現有的分散式定型程式碼，並為每個架構提供您要遵循的秘訣和範例：

訊息傳遞介面 (MPI)
- Horovod
- 來自 Open MPI 的環境變數
PyTorch
TensorFlow
使用 InfiniBand 加速 GPU 定型

必要條件

檢閱分散式 GPU 定型的基本概念，例如：資料平行處理原則、分散式資料平行處理原則和模型平行處理原則。

提示

如果您不知道要使用的平行處理原則類型，90% 以上使用分散式資料平行處理原則。

MPI

Azure Machine Learning 提供 MPI 作業，以在每個節點中啟動指定數目的程序。 Azure Machine Learning 會在幕後建構完整的 MPI 啟動命令 (mpirun)。您無法提供自己的完整前端節點啟動器命令，例如 mpirun 或 DeepSpeed launcher。

提示

Azure Machine Learning MPI 作業所使用的基礎 Docker 映像必須已安裝 MPI 程式庫。所有 Azure Machine Learning GPU 基礎映像都包含 Open MPI。當您使用自訂 Docker 映像時，您必須負責確定映像包含 MPI 程式庫。建議 Open MPI，但您也可以使用不同的 MPI 實作，例如 Intel MPI。 Azure Machine Learning 也提供適用於熱門架構的策展環境。

若要使用 MPI 執行分散式定型，請遵循下列步驟：

使用 Azure Machine Learning 環境搭配偏好的深度學習架構和 MPI。 Azure Machine Learning 提供適用於熱門架構的策展環境。或者使用偏好的深度學習架構和 MPI 來建立自訂環境。
使用 instance_count 定義 command。 instance_count 應等於每個程序啟動的每個節點的 GPU 數目，或如果使用者指令碼負責啟動每個節點的程序，則將每個節點啟動設為 1 (預設值)。
使用 command 的 distribution 參數來指定 MpiDistribution 的設定。

from azure.ai.ml import command, MpiDistribution

job = command(
    code="./src",  # local path where the code is stored
    command="python train.py --epochs ${{inputs.epochs}}",
    inputs={"epochs": 1},
    environment="AzureML-tensorflow-2.12-cuda11@latest",
    compute="gpu-cluster",
    instance_count=2,
    distribution=MpiDistribution(process_count_per_instance=2),
    display_name="tensorflow-mnist-distributed-horovod-example"
    # experiment_name: tensorflow-mnist-distributed-horovod-example
    # description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via Horovod.
)

Horovod

當您使用 Horovod 搭配深度學習架構進行分散式定型時，請使用 MPI 作業設定。

確定您的程式碼遵循下列秘訣：

在新增 Azure Machine Learning 組件之前，會使用 Horovod 正確地檢測定型程式碼。
您的 Azure Machine Learning 環境包含 Horovod 和 MPI。 PyTorch 和 TensorFlow 策展 GPU 環境已預先設定 Horovod 及其相依性。
使用您想要的散發建立 command。

Horovod 範例

如需執行 Horovod 範例的完整筆記本，請參閱 azureml-examples：在使用 Horovod 的 MNIST 資料集上，使用分散式 MPI 來定型基本神經網路。

來自 Open MPI 的環境變數

使用 Open MPI 映像執行 MPI 作業時，您可以為已啟動的每個程序使用下列環境變數：

OMPI_COMM_WORLD_RANK：程序的排名
OMPI_COMM_WORLD_SIZE：全球規模
AZ_BATCH_MASTER_NODE：具有連接埠的主要位址，MASTER_ADDR:MASTER_PORT
OMPI_COMM_WORLD_LOCAL_RANK：節點上程序的本機排名
OMPI_COMM_WORLD_LOCAL_SIZE：節點上的程序數目

提示

無論名稱為何，環境變數 OMPI_COMM_WORLD_NODE_RANK 都不會與 NODE_RANK 對應。若要使用每個節點的啟動器，請設定 process_count_per_node=1 並使用 OMPI_COMM_WORLD_RANK 做為 NODE_RANK。

PyTorch

Azure Machine Learning 支援使用 PyTorch 的原生分散式定型功能 (torch.distributed) 來執行分散式作業。

提示

針對資料平行處理原則，PyTorch 官方指引是針對單一節點和多重節點的分散式定型，使用 DistributedDataParallel (DDP) over DataParallel。 PyTorch 也建議對多處理套件使用 DistributedDataParallel。因此 Azure Machine Learning 的文件和範例會著重於 DistributedDataParallel 定型。

程序群組初始化

任何分散式定型的骨幹都是以一組彼此知道的程序為基礎，而且可以使用後端彼此通訊。若為 PyTorch，則會藉由呼叫所有分散式程序中的 torch.distributed.init_process_group 來建立處理群組，以共同形成程序群組。

torch.distributed.init_process_group(backend='nccl', init_method='env://', ...)

最常使用的通訊後端是 mpi、nccl 和 gloo。若是以 GPU 為基礎的定型，為了達到最佳效能，建議使用 nccl，而且應該盡可能使用。

init_method 會告訴每個程序如何探索彼此，以及如何使用通訊後端來初始化和驗證程序群組。依預設，如果未指定 init_method，PyTorch 會使用環境變數初始化方法 (env://)。 init_method 是建議的初始化方法，用於定型程式碼中，以在 Azure Machine Learning 上執行分散式 PyTorch。 PyTorch 會尋找用於初始化的下列環境變數：

MASTER_ADDR：裝載排名 0 程序之電腦的 IP 位址
MASTER_PORT：裝載排名 0 程序之電腦的可用連接埠
WORLD_SIZE：程序的總數。應等於用於分散式定型 (GPU) 的裝置總數
RANK：目前程序的 (全域) 排名。可能的值為 0 到 (全球大小－1)

如需處理常式群組初始化的詳細資訊，請參閱 PyTorch 文件。

許多應用程式也需要下列環境變數：

LOCAL_RANK：節點內程序的本機 (相對) 排名。可能的值為 0 到 (節點上的程序數目 -1)。此資訊很有用，因為許多作業 (例如資料準備) 只應對每個節點執行一次，通常是 local_rank = 0。
NODE_RANK：用於多重節點定型的節點排名。可能的值為 0 到 (節點總數 -1)。

您不需要使用 torch.distributed.launch 這類的啟動器公用程式。若要執行分散式 PyTorch 作業：

指定定型指令碼和引數。
建立 command，指定類型為 PyTorch，然後在 distribution 參數中指定 process_count_per_instance。 process_count_per_instance 會與您想要為作業執行的程序總數對應。 process_count_per_instance 通常應該等於 # of GPUs per node。如果未指定 process_count_per_instance，Azure Machine Learning 預設會在每個節點啟動一個程序。

Azure ML 會在每個節點上設定 MASTER_ADDR、MASTER_PORT、WORLD_SIZE 和 NODE_RANK 環境變數，並設定程序層級 RANK 和 LOCAL_RANK 環境變數。

from azure.ai.ml import command
from azure.ai.ml.entities import Data
from azure.ai.ml import Input
from azure.ai.ml import Output
from azure.ai.ml.constants import AssetTypes

# === Note on path ===
# can be can be a local path or a cloud path. AzureML supports https://`, `abfss://`, `wasbs://` and `azureml://` URIs.
# Local paths are automatically uploaded to the default datastore in the cloud.
# More details on supported paths: https://docs.microsoft.com/azure/machine-learning/how-to-read-write-data-v2#supported-paths

inputs = {
    "cifar": Input(
        type=AssetTypes.URI_FOLDER, path=returned_job.outputs.cifar.path
    ),  # path="azureml:azureml_stoic_cartoon_wgb3lgvgky_output_data_cifar:1"), #path="azureml://datastores/workspaceblobstore/paths/azureml/stoic_cartoon_wgb3lgvgky/cifar/"),
    "epoch": 10,
    "batchsize": 64,
    "workers": 2,
    "lr": 0.01,
    "momen": 0.9,
    "prtfreq": 200,
    "output": "./outputs",
}

from azure.ai.ml.entities import ResourceConfiguration

job = command(
    code="./src",  # local path where the code is stored
    command="python train.py --data-dir ${{inputs.cifar}} --epochs ${{inputs.epoch}} --batch-size ${{inputs.batchsize}} --workers ${{inputs.workers}} --learning-rate ${{inputs.lr}} --momentum ${{inputs.momen}} --print-freq ${{inputs.prtfreq}} --model-dir ${{inputs.output}}",
    inputs=inputs,
    environment="azureml:AzureML-acpt-pytorch-2.2-cuda12.1@latest",
    instance_count=2,  # In this, only 2 node cluster was created.
    distribution={
        "type": "PyTorch",
        # set process count to the number of gpus per node
        # NC6s_v3 has only 1 GPU
        "process_count_per_instance": 1,
    },
)
job.resources = ResourceConfiguration(
    instance_type="Standard_NC6s_v3", instance_count=2
)  # Serverless compute resources

Pytorch 範例

如需執行 Pytorch 範例的完整筆記本，請參閱 azureml-examples：在 CIFAR-10 上使用 PyTorch 進行分散式定型。

DeepSpeed

Azure Machine Learning 支援 DeepSpeed 作為第一級要素，在下列方面以接近線性擴縮性的方式執行分散式作業：

增加模型大小
增加 GPU 數目

您可以使用 Pytorch 散發或 MPI 來啟用 DeepSpeed，以執行分散式定型。 Azure Machine Learning 支援 DeepSpeed 啟動器來啟動分散式定型，以及自動調整以取得最佳 ds 設定。

您可以對現成可用的環境使用策展環境，並搭配最新的技術，包括 DeepSpeed、ORT、MSSCCL 和 Pytorch，來進行 DeepSpeed 定型作業。

DeepSpeed 範例

如需 DeepSpeed 定型和自動調整範例，請參閱這些資料夾。

TensorFlow

如果您在定型程式碼中使用原生分散式 TensorFlow，例如 TensorFlow 2.x 的 tf.distribute.Strategy API，則可以使用 distribution 參數或 TensorFlowDistribution 物件，透過Azure Machine Learning 啟動分散式作業。

# create the command
job = command(
    code="./src",  # local path where the code is stored
    command="python main.py --epochs ${{inputs.epochs}} --model-dir ${{inputs.model_dir}}",
    inputs={"epochs": 1, "model_dir": "outputs/keras-model"},
    environment="AzureML-tensorflow-2.12-cuda11@latest",
    compute="cpu-cluster",
    instance_count=2,
    # distribution = {"type": "mpi", "process_count_per_instance": 1},
    # distribution={
    #     "type": "tensorflow",
    #     "parameter_server_count": 1,  # for legacy TensorFlow 1.x
    #     "worker_count": 2,
    #     "added_property": 7,
    # },
    # distribution = {
    #        "type": "pytorch",
    #        "process_count_per_instance": 4,
    #        "additional_prop": {"nested_prop": 3},
    #    },
    display_name="tensorflow-mnist-distributed-example"
    # experiment_name: tensorflow-mnist-distributed-example
    # description: Train a basic neural network with TensorFlow on the MNIST dataset, distributed via TensorFlow.
)

# can also set the distribution in a separate step and using the typed objects instead of a dict
job.distribution = TensorFlowDistribution(worker_count=2)

如果定型指令碼使用參數伺服器策略進行分散式定型 (例如舊版 TensorFlow 1.x)，您也必須在 command 的 distribution 參數內，指定要在作業中使用的參數伺服器數目。例如，在上述範例中，"parameter_server_count" : 1 和 "worker_count": 2。

TF_CONFIG

在 TensorFlow 中，需要 TF_CONFIG 環境變數，才能在多部電腦上進行定型。若為 TensorFlow 作業，Azure Machine Learning 會在執行定型指令碼之前，為每個背景工作角色設定適當的 TF_CONFIG 變數。

如果需要的話，可透過定型指令碼存取 TF_CONFIG：os.environ['TF_CONFIG']。

在背景工作角色節點上設定的 TF_CONFIG 範例：

TF_CONFIG='{
    "cluster": {
        "worker": ["host0:2222", "host1:2222"]
    },
    "task": {"type": "worker", "index": 0},
    "environment": "cloud"
}'

TensorFlow 範例

如需執行 TensorFlow 範例的完整筆記本，請參閱 azureml-examples：在搭配使用 Tensorflow 與 Horovod 的 MNIST 資料集上，使用分散式 MPI 來定型基本神經網路。

使用 InfiniBand 加速分散式 GPU 定型

透過增加對模型進行定型的 VM 數目，將該模型定型所需的時間應該要減少。在理想情況下，減少的時間應該要與進行定型的 VM 數目成線性比例。例如，如果在一部 VM 上將模型定型需要 100 秒，則在兩部 VM 上將相同的模型定型應該只需要 50 秒。在四部 VM 上將模型定型應該只需要 25 秒，依此類推。

InfiniBand 可以是達成此線性縮放的重要因素。 InfiniBand 可跨叢集中的節點提供低延遲的 GPU 對 GPU 通訊。 InfiniBand 需要特殊硬體才能運作。特定 Azure VM 系列 (特別是 NC、ND 和 H 系列)，現在具有支援 RDMA 功能的 VM，提供對 SR-IOV 和 InfiniBand 的支援。這些 VM 會透過低延遲和高頻寬的 InfiniBand 網路通訊，這比以乙太網路為基礎的連線有更高效能。適用於 InfiniBand 的 SR-IOV 可為任何 MPI 程式庫提供近乎裸機的效能 (MPI 正由許多分散式定型架構和工具所使用，包括 NVIDIA 的 NCCL 軟體。)這些 SKU 旨在滿足需要大量計算的 GPU 加速機器學習工作負載需求。如需詳細資訊，請參閱使用 SR-IOV 加速 Azure Machine Learning 中的分散式定型。

一般而言，名稱中有 'r' 的 VM SKU 包含必要的 InfiniBand 硬體，而沒有 'r' 的 VM SKU 則通常沒有包含。 (“r”是 RDMA 的參考，代表遠端直接記憶體存取。)例如，VM SKU Standard_NC24rs_v3 已啟用 InfiniBand，但 SKU Standard_NC24s_v3 並未啟用。除了 InfiniBand 功能之外，這兩個 SKU 之間的規格大致相同。兩者都有 24 個核心、448 GB RAM、4 個相同 SKU 的 GPU 等等。深入了解已啟用 RDMA 和 InfiniBand 的機器 SKU。

警告

舊世代機器 SKU Standard_NC24r 已啟用 RDMA，但不包含 InfiniBand 所需的 SR-IOV 硬體。

如果您建立了其中一個可提供 RDMA 功能且可啟用 InfiniBand 大小的 AmlCompute 叢集，OS 映像會預先安裝並預先設定啟用 InfiniBand 所需的 Mellanox OFED 驅動程式。

共用方式為