How to create Linux GPU nodepool in AKS with OS SKU and use UseGPUDedicatedVHD to install Nvidia drivers.

Manjunath Gurik 0 Reputation points
2023-06-08T22:40:15.8366667+00:00

Hi,

We are creating GPU nodepool in AKS using the first approach from this link

https://learn.microsoft.com/en-us/azure/aks/gpu-cluster

az aks nodepool add \

   --resource-group my-scus-rg-app \

   --cluster-name k8s-dev-aks \

   --name gpunp2 \

   --node-count 1 \

   --node-vm-size Standard_NC6s_v3 \

   --node-taints sku=gpu:NoSchedule \

   --aks-custom-headers UseGPUDedicatedVHD=true \

--labels algo=qiefp-linux-gpu-nc6s-v3 \

   --enable-cluster-autoscaler \

   --min-count 0 \

   --max-count 1

This will create linux GPU nide with OS image version: AKSUbuntu-1804gen2gpucontainerd-202304.10.0

However we want to use latest Ubuntu version Ubuntu-2204

How to create GPU node pool with latest Ubuntu version and with UseGPUDedicatedVHD=true preview image to install NVIDIA driver.

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,447 questions
{count} votes

5 answers

Sort by: Most helpful
  1. El Mehdi Iddouch 0 Reputation points
    2023-06-08T22:50:43.65+00:00

    if you want to create a GPU node pool in AKS with the latest Ubuntu version (Ubuntu-2204) and the preview image with UseGPUDedicatedVHD=true to install the NVIDIA driver, you can follow these steps:

    First:

    Make sure you have the latest Azure CLI version installed.

    Second:

    Run the following command to create the GPU node pool:

    az aks nodepool add
    --resource-group <resource-group-name>
    --cluster-name <cluster-name>
    --name gpunp2
    --node-count 1
    --node-vm-size Standard_NC6s_v3
    --node-taints sku=gpu:NoSchedule
    --labels algo=qiefp-linux-gpu-nc6s-v3
    --enable-cluster-autoscaler
    --min-count 0
    --max-count 1
    --os-type Linux
    --aks-custom-headers UseGPUDedicatedVHD=true
    --image-reference publisher=Canonical,offer=0001-com-ubuntu-server-focal,sku=20_04-lts-gen2,p3=VHD

    after that , Make sure to replace <resource-group-name> with the name of your resource group and <cluster-name> with the name of your AKS cluster.

    Explanation of the command:

    --os-type Linux: Specifies that the OS type for the node pool is Linux. --aks-custom-headers UseGPUDedicatedVHD=true: Uses the preview image with UseGPUDedicatedVHD=true to install the NVIDIA driver. --image-reference publisher=Canonical,offer=0001-com-ubuntu-server-focal,sku=20_04-lts-gen2,p3=VHD: Specifies the image reference for the Ubuntu-2204 image. This reference corresponds to the latest Ubuntu version with the GPU-specific VHD image.

    0 comments No comments

  2. Manjunath Gurik 0 Reputation points
    2023-06-09T17:08:03.89+00:00

    Hi,

    Thank you for the suggestion, when i tried to run this command, get the below error.

    unrecognized arguments: --image-reference publisher=Canonical,offer=0001-com-ubuntu-server-focal,sku=22_04-lts-gen2,p3=VHD

    My az cli version is 2.49 latest version.

    0 comments No comments

  3. El Mehdi Iddouch 0 Reputation points
    2023-06-09T20:25:44.72+00:00

    The NVIDIA driver installation is typically handled by the NVIDIA GPU Operator or other custom configurations.

    To create a GPU node pool with the latest Ubuntu version without the UseGPUDedicatedVHD=true preview image, you can use the following command:

    az aks nodepool add \
       --resource-group <resource-group-name> \
       --cluster-name <cluster-name> \
       --name gpunp2 \
       --node-count 1 \
       --node-vm-size Standard_NC6s_v3 \
       --node-taints sku=gpu:NoSchedule \
       --labels algo=qiefp-linux-gpu-nc6s-v3 \
       --enable-cluster-autoscaler \
       --min-count 0 \
       --max-count 1 \
       --os-disk-type Ephemeral \
       --os-type Linux
    

    Replace <resource-group-name> with the name of your resource group and <cluster-name> with the name of your AKS cluster.

    Again, I apologize for the confusion caused. This command will create a GPU node pool with the latest Ubuntu version using the --os-disk-type Ephemeral option.

    
    
    0 comments No comments

  4. Manjunath Gurik 0 Reputation points
    2023-06-09T20:35:39.7433333+00:00

    Hi,

    Yes. We already using this one to create GPU nodepool with latest Ubuntu, we also need NVIDIA driver, we used the Manually install Nvidia device plugin

    https://learn.microsoft.com/en-us/azure/aks/gpu-cluster

    But this approach is making node to come up more than 5mins, And we frequenty get this error

    Error: <class 'cupy_backends.cuda.api.runtime.CUDARuntimeError'>

    CUDARuntimeError('cudaErrorNoDevice: no CUDA-capable device is detected')

    Using Preview is good and fast.

    Any other approach to create Preview image with latest ubuntu


  5. Sam Rui 0 Reputation points
    2023-09-05T09:47:22.3066667+00:00

    Hi,

    I have the same problem.

    When create node pool with UseGPUDedicatedVHD=true, the image will be "AKSUbuntu-1804gen2gpucontainerd-202308.10.0", and can not change to Ubuntu-2204 version.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.