Create a fully managed GPU node pool on Azure Kubernetes Service (AKS) (preview)

Running NVIDIA GPU workloads on Azure Kubernetes Service (AKS) traditionally requires you to install and maintain the NVIDIA GPU driver, Kubernetes device plugin, and a GPU metrics exporter on each GPU node. These components enable GPU scheduling, container-level GPU access, and telemetry, but installing them manually or through the NVIDIA GPU Operator adds operational overhead.

With fully managed GPU nodes (preview), AKS installs and maintains the NVIDIA GPU driver, device plugin, and Data Center GPU Manager (DCGM) metrics exporter for you. GPU node pool creation becomes a single step, and GPU capacity behaves like any other AKS node pool.

You configure a managed GPU node pool through two fields under gpuProfile.nvidia:

managementMode (Managed or Unmanaged) controls whether AKS installs the full managed GPU stack (driver, device plugin, and DCGM metrics exporter) or the driver only. The default is Unmanaged.
migStrategy (None, Single, or Mixed) sets the Multi-Instance GPU (MIG) strategy for supported GPU SKUs such as A100 and H100. The default is None.

In this article, you provision a managed GPU node pool, optionally enable MIG, verify the stack, and run a sample GPU workload.

Important

AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:

Before you begin

This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
You need the Azure CLI version 2.85.0 or later installed. To find the version, run az --version. If you need to install or upgrade, see Install Azure CLI.
You need to install and upgrade to latest version of the aks-preview extension.
Get the credentials for your AKS cluster with az aks get-credentials before running the kubectl examples in this article.

Managed GPU components

A managed GPU node pool can include the following components on every node:

Component	What it does	What AKS manages
NVIDIA GPU driver	Kernel modules and user-space libraries that let the OS and containers talk to the GPU hardware.	Driver version selection, installation at node provisioning, and reinstallation after node image upgrades.
NVIDIA Kubernetes device plugin	DaemonSet-equivalent that advertises GPU resources (`nvidia.com/gpu`, `nvidia.com/mig-*`) to the kubelet so pods can request them.	Deployment, configuration (including MIG strategy), and lifecycle on each GPU node.
NVIDIA DCGM and DCGM metrics exporter	Data Center GPU Manager collects GPU health and utilization data and exposes Prometheus metrics (for example, `DCGM_FI_DEV_GPU_UTIL`, `DCGM_FI_DEV_GPU_TEMP`) on port `19400`.	Installation, service enablement, and the `kubernetes.azure.com/dcgm-exporter=enabled` node label used to scrape metrics.
GPU health signals	NPD signals that surface GPU-specific node conditions such as `UnhealthyNvidiaDevicePlugin` and `UnhealthyNvidiaDCGMServices`.	NPD monitoring and condition reporting on GPU nodes.

Install profiles

Two gpuProfile fields decide which of those components AKS installs:

gpuProfile.driver (Install or None): whether AKS installs the NVIDIA GPU driver.
gpuProfile.nvidia.managementMode (Managed or Unmanaged): whether AKS also installs the Kubernetes-facing GPU stack on top of the driver.

Together, they produce three install profiles:

Install profile	CLI flags	What AKS installs and manages
Full managed stack	`--enable-managed-gpu=true` (or neither flag)	All four components above: driver, device plugin, DCGM metrics exporter, and GPU health monitoring in NPD.
Driver only (default)	`--enable-managed-gpu=false`	NVIDIA GPU driver only. You install and manage the device plugin, metrics exporter, and health monitoring yourself (for example, with the NVIDIA GPU Operator).
None (BYO)	`--enable-managed-gpu=false --gpu-driver None`	Nothing. AKS doesn't install any of the four components. You own the full stack. See Bring your own GPU driver.

Defaults and overrides

Defaults: If you don't pass --enable-managed-gpu or --gpu-driver, AKS applies the Driver only profile on the node pool created with an NVIDIA GPU-enabled VM size.
Override: managementMode: Managed requires the driver, so --gpu-driver None is ignored when --enable-managed-gpu=true and the driver is still installed. To skip the driver, set both --enable-managed-gpu=false and --gpu-driver None.
Immutability: managementMode, migStrategy, and driver are all fixed at creation time. To change profile, create a new node pool.

Install the `aks-preview` CLI extension

Install the aks-preview CLI extension using the az extension add command. Version 19.0.0b29 or later is required.
```
az extension add --name aks-preview
```
Update the extension to ensure you have the latest version installed using the az extension update command.
```
az extension update --name aks-preview
```

Register the `ManagedGPUExperiencePreview` feature flag

az feature register --namespace Microsoft.ContainerService --name ManagedGPUExperiencePreview

Limitations

This feature currently supports NVIDIA GPU-enabled virtual machine (VM) sizes only.
Updating a general-purpose node pool to add a GPU VM size isn't supported on AKS.
Windows node pools aren't supported with this feature, because GPU metrics aren't supported. When you create Windows GPU node pools, AKS automatically installs and manages the drivers and DirectX device plugin. For more information, see the AKS Windows GPU documentation.
Migrating your existing multi-instance GPU node pools to use this feature isn't supported.
In-place upgrades from an existing NVIDIA GPU node pool to a managed GPU node pool aren't supported. To migrate, cordon and drain your existing GPU nodes, then redeploy your workloads to a new GPU node pool created with --enable-managed-gpu=true. For more information, see Resize node pools on AKS.
The managementMode, migStrategy, and driver fields under gpuProfile are immutable after node pool creation. To change these values, create a new node pool.
Cluster autoscaler isn't supported on managed GPU node pools during preview. Scale these pools manually.

Note

GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the pricing tool and region availability.

Create an AKS-managed GPU node pool (preview)

Add a managed GPU node pool to an existing AKS cluster by passing --enable-managed-gpu=true to az aks nodepool add. AKS sets gpuProfile.nvidia.managementMode to Managed and installs the GPU driver, device plugin, and DCGM metrics exporter automatically.

Ubuntu Linux node pool (default SKU)
Azure Linux node pool

To use the default Ubuntu operating system (OS) SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.

Add a node pool to your cluster using the az aks nodepool add command with the --enable-managed-gpu=true flag.

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=true

Confirm that the managed NVIDIA GPU software components are installed successfully:

az aks nodepool show \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp

Your output should include the following values:

...
"gpuProfile": {
    "driver": "Install",
    "driverType": "",
    "nvidia": {
        "managementMode": "Managed",
        "migStrategy": null
    }
},
...

To use Azure Linux, you specify the OS SKU by setting --os-sku to AzureLinux during node pool creation. The os-type is set to Linux by default.

Add a node pool to your cluster using the az aks nodepool add command with the --os-sku flag set to AzureLinux and --enable-managed-gpu=true.

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --os-sku AzureLinux \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=true

Confirm that the managed NVIDIA GPU software components are installed successfully:

az aks nodepool show \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp

Your output should include the following values:

...
"gpuProfile": {
    "driver": "Install",
    "driverType": "",
    "nvidia": {
        "managementMode": "Managed",
        "migStrategy": null
    }
},
...

Create a managed Multi-Instance GPU (MIG) node pool (preview)

For GPU SKUs that support Multi-Instance GPU (such as A100 and H100), configure a MIG strategy at node pool creation with the --gpu-mig-strategy flag. The strategy controls how MIG partitions are exposed to Kubernetes:

Single: All MIG instances are aggregated under the standard nvidia.com/gpu resource.
Mixed: Each MIG profile is exposed as a separate resource, such as nvidia.com/mig-1g.10gb.
None (default): MIG isn't configured.

The migStrategy field is immutable after the node pool is created.

For background on MIG partitioning, supported VM sizes, and GPU instance profiles, see Create a multi-instance GPU node pool in AKS and NVIDIA Multi-Instance GPU.

Single strategy
Mixed strategy

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mignp \
    --node-count 1 \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=true \
    --gpu-instance-profile MIG1g \
    --gpu-mig-strategy Single

With this configuration, pods request GPU resources using the standard nvidia.com/gpu resource name.

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name mignp \
    --node-count 1 \
    --node-vm-size Standard_NC24ads_A100_v4 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=true \
    --gpu-instance-profile MIG1g \
    --gpu-mig-strategy Mixed

With this configuration, pods request specific MIG partitions, for example nvidia.com/mig-1g.10gb: 1.

Verify the managed GPU node pool (preview)

After the node pool is ready, run the following checks to confirm that the full managed stack is installed and healthy.

Verify GPU-specific node conditions from Node Problem Detector (NPD):
```
GPU_NODE=$(kubectl get nodes -l agentpool=gpunp -o jsonpath='{.items[0].metadata.name}')
kubectl describe node $GPU_NODE
```
On a managed GPU node, the following conditions should both report False:

Condition Status Reason

UnhealthyNvidiaDevicePlugin False HealthyNvidiaDevicePlugin

UnhealthyNvidiaDCGMServices False HealthyNvidiaDCGMServices

Condition	Status	Reason
`UnhealthyNvidiaDevicePlugin`	`False`	`HealthyNvidiaDevicePlugin`
`UnhealthyNvidiaDCGMServices`	`False`	`HealthyNvidiaDCGMServices`

Verify that the managed GPU label is present on the node:

kubectl get node $GPU_NODE -o jsonpath='{.metadata.labels.kubernetes\.azure\.com/dcgm-exporter}'

Expected output: enabled.

Verify that GPU resources are advertised in the node's allocatable resources:
```
kubectl get node $GPU_NODE -o jsonpath='{.status.allocatable}'
```
For a non-MIG node pool, the output includes "nvidia.com/gpu": "1" (or more, depending on the SKU). For a MIG Mixed node pool, the output includes MIG-specific resources such as "nvidia.com/mig-1g.10gb": "7".

Run a sample workload to confirm GPU access from within a container:

apiVersion: v1
kind: Pod
metadata:
  name: managed-gpu-test
spec:
  restartPolicy: Never
  tolerations:
    - key: "sku"
      operator: "Equal"
      value: "gpu"
      effect: "NoSchedule"
  containers:
  - name: gpu-test
    image: mcr.microsoft.com/azuredocs/samples-tf-mnist-demo:gpu
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1

View the pod logs to see nvidia-smi output showing the GPU device, driver version, and CUDA version:

kubectl logs managed-gpu-test

Scale a managed GPU node pool (preview)

Scale a managed GPU node pool manually with az aks nodepool scale. New nodes install the full managed GPU stack.

az aks nodepool scale \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 2

Important

During preview, managed GPU node pools don't support the cluster autoscaler. Scale these pools manually.

Alternative install profiles

If the Full managed stack profile isn't the right fit, AKS supports two alternative profiles on GPU node pools.

Driver only (preview)
Bring your own GPU driver

Use this profile when you want AKS to install and maintain the NVIDIA GPU driver, but you plan to deploy the device plugin and metrics exporter yourself (for example, with the NVIDIA GPU Operator). Set --enable-managed-gpu=false:

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=false

With this configuration:

AKS installs and manages the NVIDIA GPU driver (gpuProfile.driver is Install).
AKS doesn't install a device plugin, DCGM metrics exporter, or GPU health rules. gpuProfile.nvidia is null.
No nvidia.com/gpu resource is advertised until you deploy a device plugin.

Use this profile when you want to install the NVIDIA driver yourself (for example, to run the NVIDIA GPU Operator). Create the node pool with --enable-managed-gpu=false --gpu-driver None:

az aks nodepool add \
    --resource-group myResourceGroup \
    --cluster-name myAKSCluster \
    --name gpunp \
    --node-count 1 \
    --node-vm-size Standard_NC6s_v3 \
    --node-taints sku=gpu:NoSchedule \
    --enable-managed-gpu=false \
    --gpu-driver None

In this configuration, AKS installs no GPU software components: no driver, no device plugin, and no metrics exporter. Microsoft doesn't support or manage NVIDIA drivers or any other GPU components on these nodes. For more information, see Skip GPU driver installation.

Next steps

Deploy a sample GPU workload on your AKS-managed GPU-enabled nodes.
Learn about GPU utilization and performance metrics from managed NVIDIA DCGM exporter on your GPU node pool.

Use NVIDIA GPUs on AKS for the standard (non-managed) GPU experience.
Create a multi-instance GPU (MIG) node pool for background on MIG partitioning and supported VM sizes.
NVIDIA GPU Operator for managing GPU drivers and the device plugin yourself.
Monitor GPU metrics from the managed NVIDIA DCGM exporter.
GPU health monitoring with Node Problem Detector (NPD) on AKS.
Use Windows GPUs on AKS for Windows GPU node pools.
Azure GPU VM sizes for the full list of NVIDIA GPU-enabled VMs.
Run distributed inference on multiple AKS GPU nodes.

Pripomienky

Bola táto stránka užitočná?

Last updated on 2026-05-05