部署使用 GPU 資源的容器執行個體

發行項
08/29/2024

若要在 Azure 容器執行個體上執行某些計算密集型工作負載，請部署包含「GPU 資源」的容器群組。群組中的容器執行個體可以存取一或多個 NVIDIA Tesla GPU，同時執行容器工作負載，例如 CUDA 和深度學習應用程式。

本文說明當您使用 YAML 檔案或 Resource Manager 範本，部署容器群組時，如何新增 GPU 資源。您也可以在使用 Azure 入口網站部署容器執行個體時指定 GPU 資源。

重要

K80 和 P100 GPU SKU 將於 2023 年 8 月 31 日前淘汰。這是因為使用的基礎 VM (NC 系列和 NCv2 系列) 已淘汰；雖然 V100 SKU 可供使用，但建議改用 Azure Kubernetes Service。 GPU 資源未完全受支援，不應用於生產工作負載。使用下列資源立即移轉至 AKS：如何移轉至 AKS。

重要

此功能目前在預覽階段，但有某些限制。若您同意補充的使用規定即可取得預覽。在公開上市 (GA) 之前，此功能的某些領域可能會變更。

必要條件

注意

由於某些目前的限制，並非所有限制增加要求都會獲得核准。

如果您想要將此 SKU 用於生產容器部署，請建立 Azure 支援要求以增加限制。

預覽限制

在預覽中，使用在容器群組中的 GPU 資源時，適用下列限制。

區域可用性

地區	OS	可用的 GPU SKU
美國東部、西歐、美國西部 2、東南亞、印度中部	Linux	V100

會隨時間新增其他區域的支援。

支援的 OS 類型：僅限 Linux

其他資訊：將容器群組部署至虛擬網路時，無法使用 GPU 資源。

關於 GPU 資源

計數和 SKU

若要在容器執行個體中使用 GPU，請使用下列資訊來指定「GPU 資源」：

計數 - GPU 的數目：1、2 或 4。
SKU - GPU SKU：V100。每個 SKU 都對應到下列其中一個啟用 GPU 的 Azure VM 系列的 NVIDIA Tesla GPU：

SKU VM 系列

V100 NCv3

SKU	VM 系列
V100	NCv3

每個 SKU 的資源數上限

OS	GPU SKU	GPU 計數	最大 CPU	最大記憶體 (GB)	儲存體 (GB)
Linux	V100	1	6	112	50
Linux	V100	2	12	224	50
Linux	V100	4	24	448	50

部署 GPU 資源時，請針對工作負載設定適當的 CPU 和記憶體資源，其最大值如上表中所示。這些值目前大於容器群組中可用的 CPU 和記憶體，沒有 GPU 資源。

重要

GPU 資源的預設訂用帳戶限制 (配額) 會因 SKU 而有所不同。 V100 SKU 的預設 CPU 限制最初會設定為 0。若要在可用區域中要求增加，請提交 Azure 支援要求。

須知事項

部署時間 - 建立包含 GPU 資源的容器群組需要最多 8-10 分鐘。這是因為在 Azure 中佈建和設定 GPU VM 的額外時間。
定價 - 類似於沒有 GPU 資源的容器群組，Azure 是以含 GPU 資源的容器群組在其「持續時間」內所耗用的資源來計費。持續時間是從提取您第一個容器的映像開始計算，直到容器群組終止。不包含部署容器群組的時間。

參閱定價詳細資料。
CUDA 驅動程式 - 含 GPU 資源的容器執行個體已預先佈建 NVIDIA CUDA 驅動程式和容器執行階段，因此您可以使用針對 CUDA 工作負載開發的容器映像。

我們在此階段支援至 CUDA 11。例如，您可以針對 Dockerfile 使用下列基礎映像：
- nvidia/cuda:11.4.2-base-ubuntu20.04
- tensorflow/tensorflow:devel-gpu
注意

若要在使用公用內容映像時，從 Docker Hub 改善可靠性，請在私人 Azure 容器登錄中匯入和管理映像，並更新 Dockerfile，來使用私人受控基礎映像。深入了解公用映像的使用方式。

YAML 範例

新增 GPU 資源的方法之一是使用 YAML 檔案部署容器群組。將下列 YAML 複製到名為 gpu-deploy-aci.yaml 的新檔案中，然後儲存檔案。此 YAML 會建立名為 gpucontainergroup 的容器群組，指定含 V100 GPU 的容器執行個體。該執行個體執行範例 CUDA 向量加法應用程式。資源要求足以執行工作負載。

注意

下列範例會使用公用容器映像。若要改善可靠性，請在私人 Azure 容器登錄中匯入和管理映像，並更新 YAML，來使用私人受控基礎映像。深入了解公用映像的使用方式。

additional_properties: {}
apiVersion: '2021-09-01'
name: gpucontainergroup
properties:
  containers:
  - name: gpucontainer
    properties:
      image: k8s-gcrio.azureedge.net/cuda-vector-add:v0.1
      resources:
        requests:
          cpu: 1.0
          memoryInGB: 1.5
          gpu:
            count: 1
            sku: V100
  osType: Linux
  restartPolicy: OnFailure

使用 az container create 命令來部署容器群組，並針對 --file 參數指定 YAML 檔案名稱。您需要提供資源群組的名稱和容器群組的位置，例如支援 GPU 資源的 eastus。

az container create --resource-group myResourceGroup --file gpu-deploy-aci.yaml --location eastus

部署需要數分鐘才能完成。然後，容器會啟動，並執行 CUDA 向量加法運算式。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergroup --container-name gpucontainer

輸出：

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Resource Manager 範本範例

部署包含 GPU 資源之容器群組的另一個方式是使用 Resource Manager 範本。由建立名為 gpudeploy.json 的檔案開始，並將下列 JSON 複製到該檔案中。此範例會使用 V100 GPU 部署容器執行個體，其會針對 MNIST 資料集執行 TensorFlow 定型工作。資源要求足以執行工作負載。

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "containerGroupName": {
        "type": "string",
        "defaultValue": "gpucontainergrouprm",
        "metadata": {
          "description": "Container Group name."
        }
      }
    },
    "variables": {
      "containername": "gpucontainer",
      "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"
    },
    "resources": [
      {
        "name": "[parameters('containerGroupName')]",
        "type": "Microsoft.ContainerInstance/containerGroups",
        "apiVersion": "2021-09-01",
        "location": "[resourceGroup().location]",
        "properties": {
            "containers": [
            {
              "name": "[variables('containername')]",
              "properties": {
                "image": "[variables('containerimage')]",
                "resources": {
                  "requests": {
                    "cpu": 4.0,
                    "memoryInGb": 12.0,
                    "gpu": {
                        "count": 1,
                        "sku": "V100"
                  }
                }
              }
            }
          }
        ],
        "osType": "Linux",
        "restartPolicy": "OnFailure"
        }
      }
    ]
}

使用 az deployment group create 命令來部署範本。您需要提供資源群組的名稱，且該資源群組需要是建立在支援 GPU 資源的區域 (如 eastus)。

az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json

部署需要數分鐘才能完成。然後，容器會啟動並執行 TensorFlow 作業。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergrouprm --container-name gpucontainer

輸出：

2018-10-25 18:31:10.155010: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-25 18:31:10.305937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: ccb6:00:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-10-25 18:31:10.305981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100, pci bus id: ccb6:00:00.0, compute capability: 3.7)
2018-10-25 18:31:14.941723: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.097
Accuracy at step 10: 0.6993
Accuracy at step 20: 0.8208
Accuracy at step 30: 0.8594
...
Accuracy at step 990: 0.969
Adding run metadata for 999

清除資源

因為使用 GPU 資源很昂貴，所以請確保您的容器不會非預期地長時間執行。您可以在 Azure 入口網站中監視您的容器，或使用 az container show 命令來檢查容器群組的狀態。例如：

az container show --resource-group myResourceGroup --name gpucontainergroup --output table

當您使用完所建立的容器執行個體後，請使用下列命令將其刪除：

az container delete --resource-group myResourceGroup --name gpucontainergroup -y
az container delete --resource-group myResourceGroup --name gpucontainergrouprm -y

下一步

深入了解如何使用 YAML 檔案或 Resource Manager 範本來部署容器群組。
深入了解 Azure 中的 GPU 最佳化 VM 大小。

共用方式為