Poznámka
Na prístup k tejto stránke sa vyžaduje oprávnenie. Môžete sa skúsiť prihlásiť alebo zmeniť adresáre.
Na prístup k tejto stránke sa vyžaduje oprávnenie. Môžete skúsiť zmeniť adresáre.
Running NVIDIA GPU workloads on Azure Kubernetes Service (AKS) traditionally requires you to install and maintain the NVIDIA GPU driver, Kubernetes device plugin, and a GPU metrics exporter on each GPU node. These components enable GPU scheduling, container-level GPU access, and telemetry, but installing them manually or through the NVIDIA GPU Operator adds operational overhead.
With fully managed GPU nodes (preview), AKS installs and maintains the NVIDIA GPU driver, device plugin, and Data Center GPU Manager (DCGM) metrics exporter for you. GPU node pool creation becomes a single step, and GPU capacity behaves like any other AKS node pool.
You configure a managed GPU node pool through two fields under gpuProfile.nvidia:
managementMode(ManagedorUnmanaged) controls whether AKS installs the full managed GPU stack (driver, device plugin, and DCGM metrics exporter) or the driver only. The default isUnmanaged.migStrategy(None,Single, orMixed) sets the Multi-Instance GPU (MIG) strategy for supported GPU SKUs such as A100 and H100. The default isNone.
In this article, you provision a managed GPU node pool, optionally enable MIG, verify the stack, and run a sample GPU workload.
Important
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- You need the Azure CLI version 2.85.0 or later installed. To find the version, run
az --version. If you need to install or upgrade, see Install Azure CLI. - You need to install and upgrade to latest version of the
aks-previewextension. - Get the credentials for your AKS cluster with
az aks get-credentialsbefore running thekubectlexamples in this article.
Managed GPU components
A managed GPU node pool can include the following components on every node:
| Component | What it does | What AKS manages |
|---|---|---|
| NVIDIA GPU driver | Kernel modules and user-space libraries that let the OS and containers talk to the GPU hardware. | Driver version selection, installation at node provisioning, and reinstallation after node image upgrades. |
| NVIDIA Kubernetes device plugin | DaemonSet-equivalent that advertises GPU resources (nvidia.com/gpu, nvidia.com/mig-*) to the kubelet so pods can request them. |
Deployment, configuration (including MIG strategy), and lifecycle on each GPU node. |
| NVIDIA DCGM and DCGM metrics exporter | Data Center GPU Manager collects GPU health and utilization data and exposes Prometheus metrics (for example, DCGM_FI_DEV_GPU_UTIL, DCGM_FI_DEV_GPU_TEMP) on port 19400. |
Installation, service enablement, and the kubernetes.azure.com/dcgm-exporter=enabled node label used to scrape metrics. |
| GPU health signals | NPD signals that surface GPU-specific node conditions such as UnhealthyNvidiaDevicePlugin and UnhealthyNvidiaDCGMServices. |
NPD monitoring and condition reporting on GPU nodes. |
Install profiles
Two gpuProfile fields decide which of those components AKS installs:
gpuProfile.driver(InstallorNone): whether AKS installs the NVIDIA GPU driver.gpuProfile.nvidia.managementMode(ManagedorUnmanaged): whether AKS also installs the Kubernetes-facing GPU stack on top of the driver.
Together, they produce three install profiles:
| Install profile | CLI flags | What AKS installs and manages |
|---|---|---|
| Full managed stack | --enable-managed-gpu=true (or neither flag) |
All four components above: driver, device plugin, DCGM metrics exporter, and GPU health monitoring in NPD. |
| Driver only (default) | --enable-managed-gpu=false |
NVIDIA GPU driver only. You install and manage the device plugin, metrics exporter, and health monitoring yourself (for example, with the NVIDIA GPU Operator). |
| None (BYO) | --enable-managed-gpu=false --gpu-driver None |
Nothing. AKS doesn't install any of the four components. You own the full stack. See Bring your own GPU driver. |
Defaults and overrides
- Defaults: If you don't pass
--enable-managed-gpuor--gpu-driver, AKS applies the Driver only profile on the node pool created with an NVIDIA GPU-enabled VM size. - Override:
managementMode: Managedrequires the driver, so--gpu-driver Noneis ignored when--enable-managed-gpu=trueand the driver is still installed. To skip the driver, set both--enable-managed-gpu=falseand--gpu-driver None. - Immutability:
managementMode,migStrategy, anddriverare all fixed at creation time. To change profile, create a new node pool.
Install the aks-preview CLI extension
Install the
aks-previewCLI extension using theaz extension addcommand. Version 19.0.0b29 or later is required.az extension add --name aks-previewUpdate the extension to ensure you have the latest version installed using the
az extension updatecommand.az extension update --name aks-preview
Register the ManagedGPUExperiencePreview feature flag
Register the ManagedGPUExperiencePreview feature flag in your subscription using the az feature register command.
az feature register --namespace Microsoft.ContainerService --name ManagedGPUExperiencePreview
Limitations
- This feature currently supports NVIDIA GPU-enabled virtual machine (VM) sizes only.
- Updating a general-purpose node pool to add a GPU VM size isn't supported on AKS.
- Windows node pools aren't supported with this feature, because GPU metrics aren't supported. When you create Windows GPU node pools, AKS automatically installs and manages the drivers and DirectX device plugin. For more information, see the AKS Windows GPU documentation.
- Migrating your existing multi-instance GPU node pools to use this feature isn't supported.
- In-place upgrades from an existing NVIDIA GPU node pool to a managed GPU node pool aren't supported. To migrate, cordon and drain your existing GPU nodes, then redeploy your workloads to a new GPU node pool created with
--enable-managed-gpu=true. For more information, see Resize node pools on AKS. - The
managementMode,migStrategy, anddriverfields undergpuProfileare immutable after node pool creation. To change these values, create a new node pool. - Cluster autoscaler isn't supported on managed GPU node pools during preview. Scale these pools manually.
Note
GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the pricing tool and region availability.
Create an AKS-managed GPU node pool (preview)
Add a managed GPU node pool to an existing AKS cluster by passing --enable-managed-gpu=true to az aks nodepool add. AKS sets gpuProfile.nvidia.managementMode to Managed and installs the GPU driver, device plugin, and DCGM metrics exporter automatically.
To use the default Ubuntu operating system (OS) SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.
Add a node pool to your cluster using the
az aks nodepool addcommand with the--enable-managed-gpu=trueflag.az aks nodepool add \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name gpunp \ --node-count 1 \ --node-vm-size Standard_NC6s_v3 \ --node-taints sku=gpu:NoSchedule \ --enable-managed-gpu=trueConfirm that the managed NVIDIA GPU software components are installed successfully:
az aks nodepool show \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name gpunpYour output should include the following values:
... "gpuProfile": { "driver": "Install", "driverType": "", "nvidia": { "managementMode": "Managed", "migStrategy": null } }, ...
Create a managed Multi-Instance GPU (MIG) node pool (preview)
For GPU SKUs that support Multi-Instance GPU (such as A100 and H100), configure a MIG strategy at node pool creation with the --gpu-mig-strategy flag. The strategy controls how MIG partitions are exposed to Kubernetes:
Single: All MIG instances are aggregated under the standardnvidia.com/gpuresource.Mixed: Each MIG profile is exposed as a separate resource, such asnvidia.com/mig-1g.10gb.None(default): MIG isn't configured.
The migStrategy field is immutable after the node pool is created.
For background on MIG partitioning, supported VM sizes, and GPU instance profiles, see Create a multi-instance GPU node pool in AKS and NVIDIA Multi-Instance GPU.
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name mignp \
--node-count 1 \
--node-vm-size Standard_NC24ads_A100_v4 \
--node-taints sku=gpu:NoSchedule \
--enable-managed-gpu=true \
--gpu-instance-profile MIG1g \
--gpu-mig-strategy Single
With this configuration, pods request GPU resources using the standard nvidia.com/gpu resource name.
Verify the managed GPU node pool (preview)
After the node pool is ready, run the following checks to confirm that the full managed stack is installed and healthy.
Verify GPU-specific node conditions from Node Problem Detector (NPD):
GPU_NODE=$(kubectl get nodes -l agentpool=gpunp -o jsonpath='{.items[0].metadata.name}') kubectl describe node $GPU_NODEOn a managed GPU node, the following conditions should both report
False:Condition Status Reason UnhealthyNvidiaDevicePluginFalseHealthyNvidiaDevicePluginUnhealthyNvidiaDCGMServicesFalseHealthyNvidiaDCGMServicesVerify that the managed GPU label is present on the node:
kubectl get node $GPU_NODE -o jsonpath='{.metadata.labels.kubernetes\.azure\.com/dcgm-exporter}'Expected output:
enabled.Verify that GPU resources are advertised in the node's allocatable resources:
kubectl get node $GPU_NODE -o jsonpath='{.status.allocatable}'For a non-MIG node pool, the output includes
"nvidia.com/gpu": "1"(or more, depending on the SKU). For a MIGMixednode pool, the output includes MIG-specific resources such as"nvidia.com/mig-1g.10gb": "7".Run a sample workload to confirm GPU access from within a container:
apiVersion: v1 kind: Pod metadata: name: managed-gpu-test spec: restartPolicy: Never tolerations: - key: "sku" operator: "Equal" value: "gpu" effect: "NoSchedule" containers: - name: gpu-test image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu command: ["nvidia-smi"] resources: limits: nvidia.com/gpu: 1View the pod logs to see
nvidia-smioutput showing the GPU device, driver version, and CUDA version:kubectl logs managed-gpu-test
Scale a managed GPU node pool (preview)
Scale a managed GPU node pool manually with az aks nodepool scale. New nodes install the full managed GPU stack.
az aks nodepool scale \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name gpunp \
--node-count 2
Important
During preview, managed GPU node pools don't support the cluster autoscaler. Scale these pools manually.
Alternative install profiles
If the Full managed stack profile isn't the right fit, AKS supports two alternative profiles on GPU node pools.
Use this profile when you want AKS to install and maintain the NVIDIA GPU driver, but you plan to deploy the device plugin and metrics exporter yourself (for example, with the NVIDIA GPU Operator). Set --enable-managed-gpu=false:
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name gpunp \
--node-count 1 \
--node-vm-size Standard_NC6s_v3 \
--node-taints sku=gpu:NoSchedule \
--enable-managed-gpu=false
With this configuration:
- AKS installs and manages the NVIDIA GPU driver (
gpuProfile.driverisInstall). - AKS doesn't install a device plugin, DCGM metrics exporter, or GPU health rules.
gpuProfile.nvidiaisnull. - No
nvidia.com/gpuresource is advertised until you deploy a device plugin.
Next steps
- Deploy a sample GPU workload on your AKS-managed GPU-enabled nodes.
- Learn about GPU utilization and performance metrics from managed NVIDIA DCGM exporter on your GPU node pool.
Related articles
- Use NVIDIA GPUs on AKS for the standard (non-managed) GPU experience.
- Create a multi-instance GPU (MIG) node pool for background on MIG partitioning and supported VM sizes.
- NVIDIA GPU Operator for managing GPU drivers and the device plugin yourself.
- Monitor GPU metrics from the managed NVIDIA DCGM exporter.
- GPU health monitoring with Node Problem Detector (NPD) on AKS.
- Use Windows GPUs on AKS for Windows GPU node pools.
- Azure GPU VM sizes for the full list of NVIDIA GPU-enabled VMs.
- Run distributed inference on multiple AKS GPU nodes.