在 Azure Kubernetes Service (AKS) 上使用 Windows GPU 處理計算密集型工作負載 (預覽)

文章
10/05/2024

圖形處理單元 (GPU) 通常用來處理計算密集型工作負載，例如，圖形和視覺效果工作負載。 AKS 支援已啟用 GPU 的 Windows 和 Linux 節點集區，以執行計算密集型 Kubernetes 工作負載。

本文可協助您在新的和現有的 AKS 叢集上，佈建具有可排程 GPU 的 Windows 節點 (預覽)。

支援已啟用 GPU 的虛擬機器 (VM)

若要檢視支援的已啟用 GPU 的 VM，請參閱 Azure 中的 GPU 最佳化 VM 大小。針對 AKS 節點集區，建議使用的大小下限為 Standard_NC6s_v3。 AKS 不支援 (以 AMD GPU 為基礎的) NVv4 系列。

注意

已啟用 GPU 的 VM 包含專用硬體，其受限於較高的價格和區域可用性。如需詳細資訊，請參閱定價工具和區域可用性。

限制

不支援更新現有 Windows 節點集區以新增 GPU。
Kubernetes 1.28 版和更新版本不支援。

開始之前

本文假設您目前擁有 AKS 叢集。若您沒有叢集，請使用 Azure CLI (部分機器翻譯)、Azure PowerShell (部分機器翻譯) 或 Azure 入口網站來建立一個。
您必須安裝並設定 Azure CLI 1.0.0b2 版或更新版本，才能使用 --skip-gpu-driver-install 欄位搭配 az aks nodepool add 命令。執行 az --version 以尋找版本。如果您需要安裝或升級，請參閱安裝 Azure CLI。
您需要安裝 Azure CLI 9.0.0b5 版或更新版本，並設定為搭配命令使用 --driver-type 字段 az aks nodepool add 。執行 az --version 以尋找版本。如果您需要安裝或升級，請參閱安裝 Azure CLI。

取得叢集的認證

使用 az aks get-credentials 命令取得 AKS 叢集的認證。下列範例命令會針對 myResourceGroup 資源群組中的 myAKSCluster 取得認證：
```
az aks get-credentials --resource-group myResourceGroup --name myAKSCluster
```

搭配自動安裝驅動程式使用 Windows GPU

使用 NVIDIA GPU 牽涉到安裝各種 NVIDIA 軟體元件，例如，適用於 Kubernetes 的 DirectX 裝置外掛程式 (英文)、GPU 驅動程式安裝等。當您使用支援已啟用 GPU 的 VM 建立 Windows 節點集區時，會安裝這些元件和適當的 NVIDIA CUDA 或 GRID 驅動程式。針對 NC 和 ND 系列 VM 大小，會安裝 CUDA 驅動程式。針對 NV 系列 VM 大小，則會安裝 GRID 驅動程式。

重要

AKS 預覽功能可透過自助服務，以加入方式使用。預覽會以「現狀」和「可供使用時」提供，其其不受服務等級協定和有限瑕疵擔保所保護。客戶支援部門會盡最大努力，部分支援 AKS 預覽。因此，這些功能不適合實際執行用途。如需詳細資訊，請參閱下列支援文章：

安裝 `aks-preview` Azure CLI 延伸模組

使用 az extension add 或 az extension update 命令，註冊或更新 aks-preview 延伸模組。

# Register the aks-preview extension
az extension add --name aks-preview

# Update the aks-preview extension
az extension update --name aks-preview

註冊 `WindowsGPUPreview` 功能旗標

使用 az feature register 命令註冊 WindowsGPUPreview 功能旗標。
```
az feature register --namespace "Microsoft.ContainerService" --name "WindowsGPUPreview"
```
狀態需要幾分鐘的時間才會顯示「已註冊」。

使用 az feature show 命令確認註冊狀態。

az feature show --namespace "Microsoft.ContainerService" --name "WindowsGPUPreview"

當狀態顯示為「已註冊」時，請使用 az provider register 命令，重新整理 Microsoft.ContainerService 資源提供者的註冊。
```
az provider register --namespace Microsoft.ContainerService
```

建立已啟用 Windows GPU 的節點集區 (預覽)

若要建立已啟用 Windows GPU 的節點集區，您必須使用支援的已啟用 GPU 的 VM 大小，並將 os-type 指定為 Windows。預設 Windows os-sku 為 Windows2022，但支援所有 Windows os-sku 選項。

使用 az aks nodepool add 命令建立已啟用 Windows GPU 的節點集區。

az aks nodepool add \
   --resource-group myResourceGroup \
   --cluster-name myAKSCluster \
   --name gpunp \
   --node-count 1 \
   --os-type Windows \
   --kubernetes-version 1.29.0 \
   --node-vm-size Standard_NC6s_v3

檢查您的 GPU 是否可進行排程。
確認 GPU 已可進行排程之後，您就可以執行 GPU 工作負載。

指定 GPU 驅動程式型態（預覽）

根據預設，AKS 會為每個支援 GPU 的 VM 指定預設的 GPU 驅動程式類型。因為工作負載和驅動程式相容性對於運作中的 GPU 工作負載很重要，因此您可以指定 Windows GPU 節點的驅動程式類型。 Linux GPU節點集區不支援此功能。

使用 GPU 支援建立 Windows 代理程式集區時，您可以選擇使用 --driver-type 旗標指定 GPU 驅動程式的類型。

可用的選項如下：

GRID：適用於需要虛擬化支援的應用程式。
CUDA：針對科學運算和數據密集型應用程式中的計算工作優化。

注意

當您設定旗標時 --driver-type ，您負責確保選取的驅動程式類型與節點集區的特定 VM 大小和組態相容。雖然 AKS 嘗試驗證相容性，但在某些情況下，節點集區建立可能會因為指定的驅動程式類型和基礎 VM 或硬體不相容而失敗。

若要使用特定 GPU 驅動程式類型建立已啟用 Windows GPU 的節點集區，請使用 az aks nodepool add 命令。

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --os-type Windows \
    --kubernetes-version 1.29.0 \
    --node-vm-size Standard_NC6s_v3 \
    --driver-type GRID

例如，上述命令會使用 GPU 驅動程式類型建立已啟用 GPU 的 GRID 節點集區。選取此驅動程式類型會覆寫 NC 系列 VM SKU 的 CUDA 驅動程式類型預設值。

搭配手動安裝驅動程式使用 Windows GPU

在 AKS 中建立具有 N 系列 (NVIDIA GPU) VM 大小的 Windows 節點集區時，會自動安裝 GPU 驅動程式和 Kubernetes DirectX 裝置外掛程式。若要略過此自動安裝，請使用下列步驟：

使用 --skip-gpu-driver-install 略過 GPU 驅動程式安裝 (預覽)。
手動安裝 Kubernetes DirectX 裝置外掛程式。

略過 GPU 驅動程式安裝 (預覽)

AKS 預設會啟用自動 GPU 驅動程式安裝。在某些情況下 (例如，安裝您自己的驅動程式)，建議您略過 GPU 驅動程式安裝。

重要

使用 az extension add 或 az extension update 命令，註冊或更新 aks-preview 延伸模組。

# Register the aks-preview extension
az extension add --name aks-preview

# Update the aks-preview extension
az extension update --name aks-preview

使用具有 --skip-gpu-driver-install 旗標的 az aks nodepool add 命令來建立節點集區，以略過自動 GPU 驅動程式安裝。

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --os-type windows \
    --os-sku windows2022 \
    --skip-gpu-driver-install

注意

如果您使用的 --node-vm-size 尚未在 AKS 上架，則無法使用 GPU，且 --skip-gpu-driver-install 無法運作。

手動安裝 Kubernetes DirectX 裝置外掛程式

您可以部署適用於 Kubernetes DirectX 裝置外掛程式的 DaemonSet，其會在每個節點上執行 Pod，以提供 GPU 所需的驅動程式。

使用 az aks nodepool add 命令，將節點集區新增至叢集。

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --os-type windows \
    --os-sku windows2022

建立命名空間並部署 Kubernetes DirectX 裝置外掛程式

使用 kubectl create namespace 命令來建立命名空間。
```
kubectl create namespace gpu-resources
```

建立名為 k8s-directx-device-plugin.yaml 的檔案，並貼上提供的下列 YAML 資訊清單做為適用於 Kubernetes 專案的 NVIDIA 裝置外掛程式 (英文) 的一部分：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: gpu-resources
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # Mark this pod as a critical add-on; when enabled, the critical add-on scheduler
      # reserves resources for critical add-on pods so that they can be rescheduled after
      # a failure.  This annotation works in tandem with the toleration below.
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # Allow this pod to be rescheduled while the node is in "critical add-ons only" mode.
      # This, along with the annotation above marks this pod as a critical add-on.
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      - key: "sku"
        operator: "Equal"
        value: "gpu"
        effect: "NoSchedule"
      containers:
      - image: mcr.microsoft.com/oss/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

建立 DaemonSet，並使用 kubectl apply (英文) 命令來確認已成功建立 NVIDIA 裝置外掛程式。
```
kubectl apply -f nvidia-device-plugin-ds.yaml
```
您已成功安裝 NVIDIA 裝置外掛程式，現在您可以檢查您的 GPU 是否可排程。

確認 GPU 可進行排程

建立叢集之後，確認 GPU 可在 Kubernetes 中進行排程。

使用 kubectl get nodes (英文) 命令，列出叢集中的節點。

kubectl get nodes

您的輸出看起來應類似下列的範例輸出：

NAME                   STATUS   ROLES   AGE   VERSION
aks-gpunp-28993262-0   Ready    agent   13m   v1.20.7

使用 kubectl describe node (英文) 命令，確認 GPU 可進行排程。
```
kubectl describe node aks-gpunp-28993262-0
```
在 [容量] 區段下，GPU 應顯示為 microsoft.com/directx: 1。您的輸出看起來應類似下列的緊縮範例輸出：
```
Capacity:
[...]
 microsoft.com.directx/gpu:                 1
[...]
```

使用容器深入解析來監視 GPU 使用量

使用 AKS 的容器見解會監視下列 GPU 使用計量：

度量名稱	計量維度 (標籤)	描述
containerGpuDutyCycle	`container.azm.ms/clusterId`、、`container.azm.ms/clusterNamecontainerName`、`gpuId`、、`gpuModel`、`gpuVendor`	過去範例期間 (60 秒) 容器 GPU 忙碌/主動處理的時間百分比。工作週期是介於 1 與 100 之間的數字。
containerGpuLimits	`container.azm.ms/clusterId`、、 `container.azm.ms/clusterNamecontainerName`	每個容器都可以將限制指定為一或多個 GPU。無法要求或限制 GPU 的小部分。
containerGpuRequests	`container.azm.ms/clusterId`、、 `container.azm.ms/clusterNamecontainerName`	每個容器都可以要求一或多個 GPU。無法要求或限制 GPU 的小部分。
containerGpumemoryTotalBytes	`container.azm.ms/clusterId`、、`container.azm.ms/clusterNamecontainerName`、`gpuId`、、`gpuModel`、`gpuVendor`	可用於特定容器的 GPU 記憶體數量，以位元組為單位。
containerGpumemoryUsedBytes	`container.azm.ms/clusterId`、、`container.azm.ms/clusterNamecontainerName`、`gpuId`、、`gpuModel`、`gpuVendor`	特定容器所使用的 GPU 記憶體數量，以位元組為單位。
nodeGpuAllocatable	`container.azm.ms/clusterId`、、 `container.azm.ms/clusterNamegpuVendor`	Kubernetes 可使用節點中的 GPU 數目。
nodeGpuCapacity	`container.azm.ms/clusterId`、、 `container.azm.ms/clusterNamegpuVendor`	節點中的 GPU 總數。

清除資源

使用 kubectl delete job (英文) 命令，移除您在本文中建立的相關聯 Kubernetes 物件。
```
kubectl delete jobs windows-gpu-workload
```

下一步

若要執行 Apache Spark 作業，請參閱在 AKS 上執行 Apache Spark 作業。
如需 Kubernetes 排程器功能的詳細資訊，請參閱 AKS 中進階排程器功能的最佳做法 (部分機器翻譯)。
如需有關 Azure Kubernetes Service 和 Azure Machine Learning 的詳細資訊，請參閱：

分享方式：

在 Azure Kubernetes Service (AKS) 上使用 Windows GPU 處理計算密集型工作負載 (預覽)

支援已啟用 GPU 的虛擬機器 (VM)

限制

開始之前

取得叢集的認證

搭配自動安裝驅動程式使用 Windows GPU

安裝 `aks-preview` Azure CLI 延伸模組

註冊 `WindowsGPUPreview` 功能旗標

建立已啟用 Windows GPU 的節點集區 (預覽)

指定 GPU 驅動程式型態（預覽）

搭配手動安裝驅動程式使用 Windows GPU

略過 GPU 驅動程式安裝 (預覽)

手動安裝 Kubernetes DirectX 裝置外掛程式

建立命名空間並部署 Kubernetes DirectX 裝置外掛程式

確認 GPU 可進行排程

使用容器深入解析來監視 GPU 使用量

清除資源

下一步

更多資源

分享方式：

在 Azure Kubernetes Service (AKS) 上使用 Windows GPU 處理計算密集型工作負載 (預覽)

支援已啟用 GPU 的虛擬機器 (VM)

限制

開始之前

取得叢集的認證

搭配自動安裝驅動程式使用 Windows GPU

安裝 aks-preview Azure CLI 延伸模組

註冊 WindowsGPUPreview 功能旗標

建立已啟用 Windows GPU 的節點集區 (預覽)

指定 GPU 驅動程式型態 （預覽）

搭配手動安裝驅動程式使用 Windows GPU

略過 GPU 驅動程式安裝 (預覽)

手動安裝 Kubernetes DirectX 裝置外掛程式

建立命名空間並部署 Kubernetes DirectX 裝置外掛程式

確認 GPU 可進行排程

使用容器深入解析來監視 GPU 使用量

清除資源

下一步

更多資源

安裝 `aks-preview` Azure CLI 延伸模組

註冊 `WindowsGPUPreview` 功能旗標

指定 GPU 驅動程式型態（預覽）