分散式 GPU 定型指南 (SDK v1)

發行項
03/25/2024

深入了解如何在 Azure Machine Learning (ML) 中使用分散式 GPU 定型程式碼。本文將不會教您有關分散式定型。它將協助您在 Azure Machine Learning 上執行您現有的分散式定型程式碼。它提供您針對每個架構遵循的秘訣和範例：

訊息傳遞介面 (MPI)
- Horovod
- DeepSpeed
- 來自 Open MPI 的環境變數
PyTorch
- 程序群組初始化
- 啟動選項
- DistributedDataParallel (每個程序啟動)
- 使用 torch.distributed.launch (每個節點啟動)
- PyTorch Lightning
- Hugging Face Transformers
TensorFlow
- TensorFlow 的環境變數 (TF_CONFIG)
使用 InfiniBand 加速 GPU 定型

必要條件

檢閱這些分散式 GPU 定型的基本概念，例如：資料平行處理原則、分散式資料平行處理原則和模型平行處理原則。

提示

如果您不知道要使用的平行處理原則類型，有超過 90% 時間，您應該使用分散式資料平行處理原則。

MPI

Azure Machine Learning 提供 MPI 作業，以在每個節點中啟動指定數目的程序。您可以採用此方法，使用每個程序啟動器或每個節點啟動器來執行分散式定型，視 process_count_per_node 是否設定為 1 (預設值) 表示每個節點啟動器，或等於每個程序啟動器的裝置/GPU 數目。 Azure Machine Learning 會在幕後建構完整的 MPI 啟動命令 (mpirun)。您無法提供自己的完整前端節點啟動器命令，例如 mpirun 或 DeepSpeed launcher。

提示

Azure Machine Learning MPI 作業所使用的基礎 Docker 映像必須已安裝 MPI 程式庫。所有 Azure Machine Learning GPU 基礎映像都包含 Open MPI。當您使用自訂 Docker 映像時，您必須負責確定映像包含 MPI 程式庫。建議 Open MPI，但您也可以使用不同的 MPI 實作，例如 Intel MPI。 Azure Machine Learning 也提供適用於熱門架構的策展環境。

若要使用 MPI 執行分散式定型，請遵循下列步驟：

使用 Azure Machine Learning 環境搭配偏好的深度學習架構和 MPI。 Azure Machine Learning 提供適用於熱門架構的策展環境。
使用 process_count_per_node 和 node_count 定義 MpiConfiguration。 process_count_per_node 應等於每個程序啟動的每個節點的 GPU 數目，或如果使用者指令碼將負責啟動每個節點的程序，則將每個節點啟動設為 1 (預設值)。
將 MpiConfiguration 物件傳遞給 ScriptRunConfig 的 distributed_job_config 參數。

from azureml.core import Workspace, ScriptRunConfig, Environment, Experiment
from azureml.core.runconfig import MpiConfiguration

curated_env_name = 'AzureML-PyTorch-1.6-GPU'
pytorch_env = Environment.get(workspace=ws, name=curated_env_name)
distr_config = MpiConfiguration(process_count_per_node=4, node_count=2)

run_config = ScriptRunConfig(
  source_directory= './src',
  script='train.py',
  compute_target=compute_target,
  environment=pytorch_env,
  distributed_job_config=distr_config,
)

# submit the run configuration to start the job
run = Experiment(ws, "experiment_name").submit(run_config)

Horovod

當您使用 Horovod 搭配深度學習架構進行分散式定型時，請使用 MPI 作業設定。

確定您的程式碼遵循下列秘訣：

在新增 Azure Machine Learning 組件之前，會使用 Horovod 正確地檢測定型程式碼
您的 Azure Machine Learning 環境包含 Horovod 和 MPI。 PyTorch 和 TensorFlow 策展 GPU 環境已預先設定 Horovod 及其相依性。
使用您想要的散發建立 MpiConfiguration。

Horovod 範例

azureml-examples：使用 Horovod 的 TensorFlow 分散式定型

DeepSpeed

請勿使用 DeepSpeed 的自訂啟動器來搭配 Azure Machine Learning 上的 DeepSpeed 程式庫執行分散式定型。請改為使用 MPI來設定 MPI 作業以啟動定型作業。

確定您的程式碼遵循下列秘訣：

您的 Azure Machine Learning 環境包含 DeepSpeed 及其相依性、Open MPI 和 mpi4py。
使用您的散發建立 MpiConfiguration。

DeepSpeed 範例

azureml-examples：在 CIFAR-10 上使用 DeepSpeed 進行分散式定型

來自 Open MPI 的環境變數

使用 Open MPI 映像執行 MPI 作業時，會啟動每個程序的下列環境變數：

OMPI_COMM_WORLD_RANK - 程序的排名
OMPI_COMM_WORLD_SIZE - 全球規模
AZ_BATCH_MASTER_NODE - 具有連接埠 MASTER_ADDR:MASTER_PORT 的主要位址
OMPI_COMM_WORLD_LOCAL_RANK - 節點上程序的本機排名
OMPI_COMM_WORLD_LOCAL_SIZE - 節點上的程序數目

提示

無論名稱為何，環境變數 OMPI_COMM_WORLD_NODE_RANK 都不會與 NODE_RANK 對應。若要使用每個節點的啟動器，請設定 process_count_per_node=1 並使用 OMPI_COMM_WORLD_RANK 做為 NODE_RANK。

PyTorch

Azure Machine Learning 支援使用 PyTorch 的原生分散式定型功能 (torch.distributed) 來執行分散式作業。

提示

針對資料平行處理原則，PyTorch 官方指引是針對單一節點和多重節點的分散式定型，使用 DistributedDataParallel (DDP) over DataParallel。 PyTorch 也建議對多重處理套件使用 DistributedDataParallel。因此 Azure Machine Learning 的文件和範例將著重於 DistributedDataParallel 定型。

程序群組初始化

任何分散式定型的骨幹都是以一組彼此知道的程序為基礎，而且可以使用後端彼此通訊。若為 PyTorch，則會藉由呼叫所有分散式程序中的 torch.distributed.init_process_group 來建立處理群組，以共同形成程序群組。

torch.distributed.init_process_group(backend='nccl', init_method='env://', ...)

最常使用的通訊後端是 mpi、nccl 和 gloo。針對以 GPU 為基礎的定型，為了達到最佳效能，建議使用 nccl，而且應該盡可能使用。

init_method 會告訴每個程序如何探索彼此，以及如何使用通訊後端來初始化和驗證程序群組。依預設，如果未指定 init_method，PyTorch 將使用環境變數初始化方法 (env://)。 init_method 是建議的初始化方法，用於定型程式碼中，以在 Azure Machine Learning 上執行分散式 PyTorch。 PyTorch 將會尋找用於初始化的下列環境變數：

MASTER_ADDR - 將裝載排名 0 程序的電腦 IP 位址。
MASTER_PORT - 將裝載排名 0 程序的電腦上的可用連接埠。
WORLD_SIZE - 程序的總數。應等於用於分散式定型 (GPU) 的裝置總數。
RANK - 目前程序的 (全域) 排名。可能的值為 0 到 (全球大小 -1)。

如需處理常式群組初始化的詳細資訊，請參閱 PyTorch 文件。

除此之外，許多應用程式也都需要下列環境變數：

LOCAL_RANK - 節點內程序的本機 (相對) 排名。可能的值為 0 到 (節點上的程序數目 -1)。此資訊很有用，因為許多作業 (例如資料準備) 只應針對每個節點執行一次 --- 通常是 local_rank = 0。
NODE_RANK - 用於多重節點定型節點的排名。可能的值為 0 到 (節點總數 -1)。

PyTorch 啟動選項

Azure Machine Learning PyTorch 作業支援兩個類型的選項來啟動分散式定型：

每個程序的啟動器：系統會為您啟動所有分散式程序，其中具有用來設定程序群組的所有相關資訊 (例如環境變數)。
每個節點的啟動器：您會提供 Azure Machine Learning 可在每個節點上執行的公用程式啟動器。公用程式啟動器將會處理在指定節點上每個程序的啟動。在每個節點的本機內，RANK 和 LOCAL_RANK 會由啟動器設定。 torch.distributed.launch 公用程式和 PyTorch Lightning 都屬於此類別。

這些啟動選項之間沒有任何基本差異。選擇主要取決於您的喜好設定，或以 vanilla PyTorch 為基礎的架構/程式庫 (例如 Lightning 或 Hugging Face) 的慣例。

下列各節將詳細說明如何針對每個啟動選項設定 Azure Machine Learning PyTorch 作業。

DistributedDataParallel (每個程序啟動)

您不需要使用 torch.distributed.launch 這類的啟動器公用程式。若要執行分散式 PyTorch 作業：

指定定型指令碼和引數
建立 PyTorchConfiguration 並指定 process_count 和 node_count。 process_count 會與您想要為作業執行的程序總數對應。 process_count 通常應該等於 # GPUs per node x # nodes。如果未指定 process_count，Azure Machine Learning 預設會在每個節點啟動一個程序。

Azure Machine Learning 會在每個節點上設定 MASTER_ADDR、MASTER_PORT、WORLD_SIZE 和 NODE_RANK 環境變數，並設定程序層級 RANK 和 LOCAL_RANK 環境變數。

若要將此選項用於每個節點多程序的定型，請使用 Azure Machine Learning Python SDK >= 1.22.0。 Process_count 是在 1.22.0 中引進。

from azureml.core import ScriptRunConfig, Environment, Experiment
from azureml.core.runconfig import PyTorchConfiguration

curated_env_name = 'AzureML-PyTorch-1.6-GPU'
pytorch_env = Environment.get(workspace=ws, name=curated_env_name)
distr_config = PyTorchConfiguration(process_count=8, node_count=2)

run_config = ScriptRunConfig(
  source_directory='./src',
  script='train.py',
  arguments=['--epochs', 50],
  compute_target=compute_target,
  environment=pytorch_env,
  distributed_job_config=distr_config,
)

run = Experiment(ws, 'experiment_name').submit(run_config)

提示

如果您的定型指令碼以指令碼引數的形式傳遞本機排名或排名之類的資訊，您可以參考引數中的環境變數：

arguments=['--epochs', 50, '--local_rank', $LOCAL_RANK]

Pytorch 每個程序啟動範例

azureml-examples：在 CIFAR-10 上使用 PyTorch 進行分散式定型

使用 torch.distributed.launch (per-node-launch)

PyTorch 在 torch.distributed.launch 中提供一個啟動公用程式，可讓您用來啟動每個節點的多個程序。 torch.distributed.launch 模組會在每個節點上繁衍多個定型程序。

下列步驟示範如何在 Azure Machine Learning 上使用每個節點的啟動器來設定 PyTorch 作業。作業可達成執行下列命令的同等項目：

python -m torch.distributed.launch --nproc_per_node <num processes per node> \
  --nnodes <num nodes> --node_rank $NODE_RANK --master_addr $MASTER_ADDR \
  --master_port $MASTER_PORT --use_env \
  <your training script> <your script arguments>

提供 torch.distributed.launch 命令給 ScriptRunConfig 建構函式的 command 參數。 Azure Machine Learning 會在您的定型叢集的每個節點上執行此命令。 --nproc_per_node 應小於或等於每個節點上可用的 GPU 數目。 MASTER_ADDR、MASTER_PORT 和 NODE_RANK 都是由 Azure Machine Learning 所設定，因此您可以直接參考命令中的環境變數。 Azure Machine Learning 會將 MASTER_PORT 設定為 6105，但您可以視需要將不同的值傳遞至 torch.distributed.launch 命令的 --master_port 引數。 (啟動公用程式會重設環境變數。)
建立 PyTorchConfiguration，並指定 node_count。

from azureml.core import ScriptRunConfig, Environment, Experiment
from azureml.core.runconfig import PyTorchConfiguration

curated_env_name = 'AzureML-PyTorch-1.6-GPU'
pytorch_env = Environment.get(workspace=ws, name=curated_env_name)
distr_config = PyTorchConfiguration(node_count=2)
launch_cmd = "python -m torch.distributed.launch --nproc_per_node 4 --nnodes 2 --node_rank $NODE_RANK --master_addr $MASTER_ADDR --master_port $MASTER_PORT --use_env train.py --epochs 50".split()

run_config = ScriptRunConfig(
  source_directory='./src',
  command=launch_cmd,
  compute_target=compute_target,
  environment=pytorch_env,
  distributed_job_config=distr_config,
)

run = Experiment(ws, 'experiment_name').submit(run_config)

提示

單一節點多 GPU 定型：如果您使用啟動公用程式來執行單一節點多 GPU PyTorch 定型，則不需要指定 ScriptRunConfig 的 distributed_job_config 參數。

launch_cmd = "python -m torch.distributed.launch --nproc_per_node 4 --use_env train.py --epochs 50".split()

run_config = ScriptRunConfig(
 source_directory='./src',
 command=launch_cmd,
 compute_target=compute_target,
 environment=pytorch_env,
)

PyTorch 每個節點啟動範例

azureml-examples：在 CIFAR-10 上使用 PyTorch 進行分散式定型

PyTorch Lightning

PyTorch Lightning 是輕量的開放原始碼程式庫，可為 PyTorch 提供高階介面。 Lightning 會取出 vanilla PyTorch 所需的許多較低層級分散式定型設定。 Lightning 能讓您在單一 GPU、單一節點多 GPU 和多節點多 GPU 設定中執行定型指令碼。在幕後，它會為您啟動多個程序，類似於 torch.distributed.launch。

針對單一節點定型 (包括單一節點多 GPU)，您可以在 Azure Machine Learning 上執行程式碼，而不需要指定 distributed_job_config。若要執行使用多個節點搭配多個 GPU 的實驗，有兩個選項：

使用 PyTorch 組態 (建議)：定義 PyTorchConfiguration 並指定 communication_backend="Nccl"、node_count 和 process_count (請注意，這是程序總數，即 num_nodes * process_count_per_node)。在 [Lightning 訓練課程] 模組中，指定 num_nodes 和 gpus 與 PyTorchConfiguration 一致。例如，num_nodes = node_count 與 gpus = process_count_per_node。

使用 MPI 組態：

定義 MpiConfiguration 並指定 node_count 和 process_count_per_node。在 [Lightning 訓練課程] 中，指定 num_nodes 和 gpus 分別與來自 MpiConfiguration 的 node_count 和 process_count_per_node 相同。
針對多重節點定型搭配 MPI，Lightning 需要在定型叢集的每個節點上設定下列環境變數：
- MASTER_ADDR
- MASTER_PORT
- NODE_RANK
- LOCAL_RANK
在主要定型指令碼中手動設定 Lightning 所需的這些環境變數：

import os
from argparse import ArgumentParser

def set_environment_variables_for_mpi(num_nodes, gpus_per_node, master_port=54965):
     if num_nodes > 1:
         os.environ["MASTER_ADDR"], os.environ["MASTER_PORT"] = os.environ["AZ_BATCH_MASTER_NODE"].split(":")
     else:
         os.environ["MASTER_ADDR"] = os.environ["AZ_BATCHAI_MPI_MASTER_NODE"]
         os.environ["MASTER_PORT"] = str(master_port)

     try:
         os.environ["NODE_RANK"] = str(int(os.environ.get("OMPI_COMM_WORLD_RANK")) // gpus_per_node)
         # additional variables
         os.environ["MASTER_ADDRESS"] = os.environ["MASTER_ADDR"]
         os.environ["LOCAL_RANK"] = os.environ["OMPI_COMM_WORLD_LOCAL_RANK"]
         os.environ["WORLD_SIZE"] = os.environ["OMPI_COMM_WORLD_SIZE"]
     except:
         # fails when used with pytorch configuration instead of mpi
         pass

if __name__ == "__main__":
     parser = ArgumentParser()
     parser.add_argument("--num_nodes", type=int, required=True)
     parser.add_argument("--gpus_per_node", type=int, required=True)
     args = parser.parse_args()
     set_environment_variables_for_mpi(args.num_nodes, args.gpus_per_node)

     trainer = Trainer(
      num_nodes=args.num_nodes,
      gpus=args.gpus_per_node
  )

Lightning 會從定型旗標 --gpus 和 --num_nodes 計算世界大小。

from azureml.core import ScriptRunConfig, Experiment
from azureml.core.runconfig import MpiConfiguration

nnodes = 2
gpus_per_node = 4
args = ['--max_epochs', 50, '--gpus_per_node', gpus_per_node, '--accelerator', 'ddp', '--num_nodes', nnodes]
distr_config = MpiConfiguration(node_count=nnodes, process_count_per_node=gpus_per_node)

run_config = ScriptRunConfig(
   source_directory='./src',
   script='train.py',
   arguments=args,
   compute_target=compute_target,
   environment=pytorch_env,
   distributed_job_config=distr_config,
)

run = Experiment(ws, 'experiment_name').submit(run_config)

Hugging Face Transformers

Hugging Face 提供許多使用其轉換器程式庫搭配 torch.distributed.launch 來執行分散式定型的範例。若要使用轉換器定型器 API 來執行這些範例和您自己的自訂定型指令碼，請遵循使用 torch.distributed.launch 區段。

在具有 8 個 GPU 的一個節點上使用 run_glue.py 指令碼，在文字分類 MNLI 作業上微調 BERT 大型模型的範例作業設定程式碼：

from azureml.core import ScriptRunConfig
from azureml.core.runconfig import PyTorchConfiguration

distr_config = PyTorchConfiguration() # node_count defaults to 1
launch_cmd = "python -m torch.distributed.launch --nproc_per_node 8 text-classification/run_glue.py --model_name_or_path bert-large-uncased-whole-word-masking --task_name mnli --do_train --do_eval --max_seq_length 128 --per_device_train_batch_size 8 --learning_rate 2e-5 --num_train_epochs 3.0 --output_dir /tmp/mnli_output".split()

run_config = ScriptRunConfig(
  source_directory='./src',
  command=launch_cmd,
  compute_target=compute_target,
  environment=pytorch_env,
  distributed_job_config=distr_config,
)

您也可以使用每個程序啟動選項來執行分散式定型，而不使用 torch.distributed.launch。使用此方法要記住的一件事，就是轉換器 TrainingArguments 預期會以引數的形式傳入本機排名 (--local_rank)。當 --use_env=False 時，torch.distributed.launch 會處理這個部分，但如果您使用每個程序的啟動，您必須將本機排名以引數的形式明確傳遞給定型指令碼 --local_rank=$LOCAL_RANK，因為 Azure Machine Learning 只會設定 LOCAL_RANK 環境變數。

TensorFlow

如果您在定型程式碼中使用原生分散式 TensorFlow，例如 TensorFlow 2.x 的 tf.distribute.Strategy API，您可以使用 TensorflowConfiguration 透過 Azure Machine Learning 啟動分散式作業。

若要這樣做，請將 TensorflowConfiguration 物件指定為 ScriptRunConfig 建構函式的 distributed_job_config 參數。如果您使用 tf.distribute.experimental.MultiWorkerMirroredStrategy，請指定 TensorflowConfiguration 中與定型作業的節點數目對應的 worker_count。

from azureml.core import ScriptRunConfig, Environment, Experiment
from azureml.core.runconfig import TensorflowConfiguration

curated_env_name = 'AzureML-TensorFlow-2.3-GPU'
tf_env = Environment.get(workspace=ws, name=curated_env_name)
distr_config = TensorflowConfiguration(worker_count=2, parameter_server_count=0)

run_config = ScriptRunConfig(
  source_directory='./src',
  script='train.py',
  compute_target=compute_target,
  environment=tf_env,
  distributed_job_config=distr_config,
)

# submit the run configuration to start the job
run = Experiment(ws, "experiment_name").submit(run_config)

如果您的定型指令碼使用參數伺服器策略進行分散式定型 (例如針對舊版 TensorFlow 1.x)，您也必須指定要在作業中使用的參數伺服器數目，例如 tf_config = TensorflowConfiguration(worker_count=2, parameter_server_count=1)。

TF_CONFIG

在 TensorFlow 中，需要 TF_CONFIG 環境變數，才能在多部電腦上進行定型。針對 TensorFlow 作業，Azure Machine Learning 會在執行定型指令碼之前，為每個背景工作角色設定適當的 TF_CONFIG 變數。

如果需要的話，您可以透過定型指令碼存取 TF_CONFIG：os.environ['TF_CONFIG']。

在背景工作角色節點上設定的 TF_CONFIG 範例：

TF_CONFIG='{
    "cluster": {
        "worker": ["host0:2222", "host1:2222"]
    },
    "task": {"type": "worker", "index": 0},
    "environment": "cloud"
}'

TensorFlow 範例

azureml-examples：使用 MultiWorkerMirroredStrategy 的分散式 TensorFlow 定型

使用 InfiniBand 加速分散式 GPU 定型

透過增加對模型進行定型的 VM 數目，將該模型定型所需的時間應該要減少。在理想情況下，減少的時間應該要與進行定型的 VM 數目成線性比例。例如，如果在一部 VM 上將模型定型需要 100 秒，則在兩部 VM 上將相同的模型定型應該只需要 50 秒。在四部 VM 上將模型定型應該只需要 25 秒，依此類推。

InfiniBand 可以是達成此線性縮放的重要因素。 InfiniBand 可跨叢集中的節點提供低延遲的 GPU 對 GPU 通訊。 InfiniBand 需要特殊硬體才能運作。特定 Azure VM 系列 (特別是 NC、ND 和 H 系列)，現在具有支援 RDMA 功能的 VM，提供對 SR-IOV 和 InfiniBand 的支援。這些 VM 會透過低延遲和高頻寬的 InfiniBand 網路通訊，這比以乙太網路為基礎的連線有更高效能。適用於 InfiniBand 的 SR-IOV 可為任何 MPI 程式庫提供近乎裸機的效能 (MPI 正由許多分散式定型架構和工具所使用，包括 NVIDIA 的 NCCL 軟體。)這些 SKU 旨在滿足需要大量計算的 GPU 加速機器學習工作負載需求。如需詳細資訊，請參閱使用 SR-IOV 加速 Azure Machine Learning 中的分散式定型。

一般而言，名稱中有 'r' 的 VM SKU 包含必要的 InfiniBand 硬體，而沒有 'r' 的 VM SKU 則通常沒有包含。 ('r' 是對 RDMA 的參考，其全名為「遠端直接記憶體存取 (Remote Direct Memory Access)」) 例如，VM SKU Standard_NC24rs_v3 已啟用 InfiniBand，但 SKU Standard_NC24s_v3 則沒有。除了 InfiniBand 功能之外，這兩個 SKU 之間的規格大致相同，兩者都有 24 個核心、448 GB RAM、相同 SKU 的 4 個 GPU 等等。深入了解已啟用 RDMA 和 InfiniBand 的機器 SKU。

警告

舊世代機器 SKU Standard_NC24r 已啟用 RDMA，但不包含 InfiniBand 所需的 SR-IOV 硬體。

如果您建立了其中一個可提供 RDMA 功能且可啟用 InfiniBand 大小的 AmlCompute 叢集，OS 映像將會預先安裝並預先設定啟用 InfiniBand 所需的 Mellanox OFED 驅動程式。

共用方式為