部署使用 GPU 資源的容器執行個體

這很重要

此產品自 2025 年 7 月 14 日起淘汰。

要在 Azure 容器實例上執行某些運算密集型工作負載，請部署你的容器群組並搭配 圖形處理器（GPU）資源。群組中的容器實例可在執行容器工作負載（如運算統一裝置架構 CUDA 及深度學習應用）時存取一個或多個 NVIDIA Tesla GPU。

本文說明如何在部署容器群組時，透過使用 YAML 檔案或 Azure 資源管理器範本（ARM 範本）來新增 GPU 資源。你也可以在部署容器實例時，透過 Azure 入口網站指定 GPU 資源。

先決條件

由於目前的一些限制，並非所有限額提升申請都能獲得批准。

如果你想用這個版本來部署生產容器，請建立 Azure 支援請求來增加限制。

預覽限制

在預覽版中，當你在容器群組中使用 GPU 資源時，會受到以下限制。

區域可用性

地區	作業系統	可用的 GPU SKU
美國東部、西歐、美國西部 2、東南亞、印度中部	Linux	V100

未來會陸續新增更多地區的支援。

支援的作業系統類型：僅限 Linux。

其他限制：當你將容器群組部署到虛擬網路時，無法使用 GPU 資源。

關於 GPU 資源

計數與版本

若要在容器執行個體中使用 GPU，請使用下列資訊來指定「GPU 資源」：

數量：GPU 數量為一、二或四。
版本：GPU 版本為 V100。每個版本都對應到下列其中一個啟用 GPU 的 Azure VM 系列的 NVIDIA Tesla GPU：

版本 VM 系列

V100 NCv3

版本	VM 系列
V100	NCv3

每個 SKU 的資源數上限

作業系統	GPU SKU	GPU 數量	最大 CPU	最大記憶體（GB）	儲存體 (GB)
Linux	V100	1	6	112	50
Linux	V100	2	12	224	50
Linux	V100	4	24	448	50

部署 GPU 資源時，請根據工作負載設定適當的 CPU 與記憶體資源，最高可達前表所示的最大值。這些值目前大於容器群組中可用的 CPU 和記憶體，沒有 GPU 資源。

這很重要

GPU 資源的預設訂閱限制（配額）會因版本而異。 V100 版本的預設 CPU 限制最初設為 0。若要申請在可用區域的提升，請提交 Azure 支援請求。

須知事項

部署時間：建立包含 GPU 資源的容器群組需時約 8 至 10 分鐘。在 Azure 中配置與配置 GPU 虛擬機（VM）需要更多時間。
定價：類似於沒有 GPU 資源的容器群組，Azure 會對容器群組在 GPU 資源期間內所消耗的資源收費。持續時間是從提取您第一個容器的映像開始計算，直到容器群組終止。它不包含部署容器群組的時間。

如需詳細資訊，請參閱定價詳細資料。
CUDA 驅動程式：具備 GPU 資源的容器實例會預先配置 NVIDIA CUDA 驅動程式與容器執行時，讓你能使用 CUDA 工作負載開發的容器映像檔。

我們在此階段支援至 CUDA 11。例如，您可以針對 Docker 檔案使用下列基礎映像：
- nvidia/cuda:11.4.2-base-ubuntu20.04
- tensorflow/tensorflow:devel-gpu
為了提升使用 Docker Hub 的公開容器映像檔的可靠性，請匯入並管理該映像檔在私有的 Azure 容器登錄檔中。然後更新您的 Docker 檔案，以使用您私有管理的基礎映像檔。深入了解公用映像的使用方式。

YAML 範例

新增 GPU 資源的方法之一是使用 YAML 檔案部署容器群組。將以下的 YAML 複製到一個名為 gpu-deploy-aci.yaml 的新檔案，然後儲存該檔案。此 YAML 會建立一個名為 gpucontainergroup 容器的容器群組，指定一個擁有 V100 GPU 的容器實例。該執行個體執行範例 CUDA 向量加法應用程式。資源要求足以執行工作負載。

備註

下列範例會使用公用容器映像。為了提升可靠性，請將映像匯入並管理於私人 Azure 容器登錄檔中。然後更新你的 YAML，讓它使用你私有管理的基礎映像檔。深入了解公用映像的使用方式。

additional_properties: {}
apiVersion: '2021-09-01'
name: gpucontainergroup
properties:
  containers:
  - name: gpucontainer
    properties:
      image: k8s-gcrio.azureedge.net/cuda-vector-add:v0.1
      resources:
        requests:
          cpu: 1.0
          memoryInGB: 1.5
          gpu:
            count: 1
            sku: V100
  osType: Linux
  restartPolicy: OnFailure

用 az container create 指令部署容器群組，並指定參數的 YAML 檔名 --file 。你需要提供資源群組名稱，以及容器群組（例如 eastus）的位置，該群組支援 GPU 資源。

az container create --resource-group myResourceGroup --file gpu-deploy-aci.yaml --location eastus

部署需要數分鐘才能完成。然後，容器會啟動，並執行 CUDA 向量加法運算式。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergroup --container-name gpucontainer

輸出：

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Resource Manager 範本範例

另一種部署包含 GPU 資源的容器群組的方法是使用 ARM 範本。首先建立一個名為 gpudeploy.json. 的檔案然後把以下的 JSON 複製進去。此範例會使用 V100 GPU 部署容器執行個體，其會針對 MNIST 資料集執行 TensorFlow 定型工作。資源要求足以執行工作負載。

{
    "$schema": "https://schema.management.azure.com/schemas/2015-01-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "parameters": {
      "containerGroupName": {
        "type": "string",
        "defaultValue": "gpucontainergrouprm",
        "metadata": {
          "description": "Container Group name."
        }
      }
    },
    "variables": {
      "containername": "gpucontainer",
      "containerimage": "mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu"
    },
    "resources": [
      {
        "name": "[parameters('containerGroupName')]",
        "type": "Microsoft.ContainerInstance/containerGroups",
        "apiVersion": "2021-09-01",
        "location": "[resourceGroup().location]",
        "properties": {
            "containers": [
            {
              "name": "[variables('containername')]",
              "properties": {
                "image": "[variables('containerimage')]",
                "resources": {
                  "requests": {
                    "cpu": 4.0,
                    "memoryInGb": 12.0,
                    "gpu": {
                        "count": 1,
                        "sku": "V100"
                  }
                }
              }
            }
          }
        ],
        "osType": "Linux",
        "restartPolicy": "OnFailure"
        }
      }
    ]
}

使用 az deployment group create 命令來部署範本。您需要提供資源群組的名稱，且該資源群組需要是建立在支援 GPU 資源的區域 (如 eastus)。

az deployment group create --resource-group myResourceGroup --template-file gpudeploy.json

部署需要數分鐘才能完成。接著容器啟動並執行 TensorFlow 工作。執行 az container logs 命令以檢視記錄輸出：

az container logs --resource-group myResourceGroup --name gpucontainergrouprm --container-name gpucontainer

輸出：

2018-10-25 18:31:10.155010: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-10-25 18:31:10.305937: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla V100 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
pciBusID: ccb6:00:00.0
totalMemory: 11.92GiB freeMemory: 11.85GiB
2018-10-25 18:31:10.305981: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla V100, pci bus id: ccb6:00:00.0, compute capability: 3.7)
2018-10-25 18:31:14.941723: I tensorflow/stream_executor/dso_loader.cc:139] successfully opened CUDA library libcupti.so.8.0 locally
Successfully downloaded train-images-idx3-ubyte.gz 9912422 bytes.
Extracting /tmp/tensorflow/input_data/train-images-idx3-ubyte.gz
Successfully downloaded train-labels-idx1-ubyte.gz 28881 bytes.
Extracting /tmp/tensorflow/input_data/train-labels-idx1-ubyte.gz
Successfully downloaded t10k-images-idx3-ubyte.gz 1648877 bytes.
Extracting /tmp/tensorflow/input_data/t10k-images-idx3-ubyte.gz
Successfully downloaded t10k-labels-idx1-ubyte.gz 4542 bytes.
Extracting /tmp/tensorflow/input_data/t10k-labels-idx1-ubyte.gz
Accuracy at step 0: 0.097
Accuracy at step 10: 0.6993
Accuracy at step 20: 0.8208
Accuracy at step 30: 0.8594
...
Accuracy at step 990: 0.969
Adding run metadata for 999

清理資源

因為使用 GPU 資源可能很昂貴，請確保容器不會意外長時間運行。在 Azure 入口網站監控你的容器。你也可以用 az container show 指令檢查容器群組的狀態。例如：

az container show --resource-group myResourceGroup --name gpucontainergroup --output table

當你完成對你所建立容器實例的處理後，請用以下指令將它們刪除：

az container delete --resource-group myResourceGroup --name gpucontainergroup -y
az container delete --resource-group myResourceGroup --name gpucontainergrouprm -y

了解如何使用 YAML 檔案或 ARM 範本來部署容器群組。
了解更多關於 Azure 中 GPU 優化虛擬機大小的資訊。

意見反應

此頁面對您有幫助嗎？

Last updated on 2025-11-21