Notiz
Zougrëff op dës Säit erfuerdert Autorisatioun. Dir kënnt probéieren, Iech unzemellen oder Verzeechnesser ze änneren.
Zougrëff op dës Säit erfuerdert Autorisatioun. Dir kënnt probéieren, Verzeechnesser ze änneren.
When you run GPU workloads in Azure Kubernetes Service (AKS), you need to install and maintain several software components, including the GPU driver, Kubernetes device plugin, and GPU metrics exporter for telemetry. These components are essential for enabling GPU scheduling, container-level GPU access, observability of resource usage, and proper functioning of AKS GPU-enabled nodes. Previously, cluster operators had to either install these components manually or use open-source alternatives like the NVIDIA GPU Operator, which can introduce complexity and operational overhead.
AKS now supports fully managed GPU nodes (preview) and installs the NVIDIA GPU driver, device plugin, and Data Center GPU Manager (DCGM) metrics exporter by default. This feature enables one-step GPU node pool creation and makes the availability of GPU resources in AKS as simple as general purpose CPU nodes.
In this article, you learn how to provision a fully managed GPU node pool (preview) in your AKS cluster, including default installation of the NVIDIA GPU driver, device plugin, and metrics exporter.
Important
AKS preview features are available on a self-service, opt-in basis. Previews are provided "as is" and "as available," and they're excluded from the service-level agreements and limited warranty. AKS previews are partially covered by customer support on a best-effort basis. As such, these features aren't meant for production use. For more information, see the following support articles:
Before you begin
- This article assumes you have an existing AKS cluster. If you don't have a cluster, create one using the Azure CLI, Azure PowerShell, or the Azure portal.
- You need the Azure CLI version 2.72.2 or later installed. To find the version, run
az --version. If you need to install or upgrade, see Install Azure CLI. - You need to install and upgrade to latest version of the
aks-previewextension. - You need to register the
ManagedGPUExperiencePreviewfeature flag in your subscription.
Limitations
- This feature currently supports NVIDIA GPU-enabled virtual machine (VM) sizes only.
- Updating a general-purpose node pool to add a GPU VM size isn't supported on AKS.
- Windows node pools are not supported with this feature, because GPU metrics are not supported. When creating Windows GPU node pools, AKS automatically installs and manages the drivers and Directx device plugin. See AKS Windows GPU documentation for more information.
- Migrating your existing multi-instance GPU node pools to use this feature isn't supported.
- In-place upgrades to use this feature on existing GPU-enabled nodes isn't supported.
Note
GPU-enabled VMs contain specialized hardware subject to higher pricing and region availability. For more information, see the pricing tool and region availability.
Install the aks-preview CLI extension
Install the
aks-previewCLI extension using theaz extension addcommand.az extension add --name aks-previewUpdate the extension to ensure you have the latest version installed using the
az extension updatecommand.az extension update --name aks-preview
Register the ManagedGPUExperiencePreview feature flag in your subscription
Register the
ManagedGPUExperiencePreviewfeature flag in your subscription using theaz feature registercommand.az feature register --namespace Microsoft.ContainerService --name ManagedGPUExperiencePreview
Get the credentials for your cluster
Get the credentials for your AKS cluster using the
az aks get-credentialscommand.az aks get-credentials --resource-group $RESOURCE_GROUP --name $CLUSTER_NAME
Create an AKS-managed GPU node pool (preview)
You can add a fully managed GPU node pool (preview) to an existing AKS cluster by specifying OS SKU and --tags EnableManagedGPUExperience=true command. When you do this, AKS will install the GPU driver, GPU device plugin, and metrics exporter automatically.
To use the default Ubuntu operating system (OS) SKU, you create the node pool without specifying an OS SKU. The node pool is configured for the default operating system based on the Kubernetes version of the cluster.
Add a node pool to your cluster using the
az aks nodepool addcommand with the--tags EnableManagedGPUExperience=truecommand.az aks nodepool add \ --resource‐group MyResourceGroup \ --cluster‐name MyAKSCluster \ --name gpunp \ --node‐count 1 \ --node‐vm‐size Standard_NC6s_v3 \ --node‐taints sku=gpu:NoSchedule \ --enable‐cluster‐autoscaler \ --min‐count 1 \ --max‐count 3 \ --tags EnableManagedGPUExperience=trueConfirm that the managed NVIDIA GPU software components are installed successfully:
az aks nodepool show \ --resource-group myResourceGroup \ --cluster-name myAKSCluster \ --name gpunp \Your output should include the following values:
... ... "gpuInstanceProfile": … "gpuProfile": { "driver": "Install" }, ... ...
Migrate existing GPU workloads to an AKS-managed GPU node pool
In-place upgrades from a standard NVIDIA GPU node pool to a fully managed NVIDIA GPU node pool (preview) on your AKS cluster isn't supported. We recommend cordoning and draining your existing GPU nodes, then redeploying your workloads to a new GPU-enabled node pool with this feature enabled. See Resize node pools on AKS to learn more.
Bring your own (BYO) GPU driver
If you want to control the installation of the NVIDIA drivers or use the NVIDIA GPU Operator, you can bypass the GPU driver installation during node pool creation. In this case, Microsoft doesn't support or manage the maintenance and compatibility of the NVIDIA drivers as part of the node image deployment. See Skip GPU driver installation for NVIDIA GPU-enabled nodes on AKS to learn more.
Next steps
- Deploy a sample GPU workload on your AKS-managed GPU-enabled nodes.
- Learn about GPU utilization and performance metrics from managed NVIDIA DCGM exporter on your GPU node pool.
Related articles
- Learn about GPU health monitoring with Node Problem Detector (NPD) on AKS.
- Run distributed inference on multiple AKS GPU nodes.