Azure AKS Multi Instance GPU NodePool Failure

Bogdan Serban 1 Reputation point
2022-10-24T16:33:12.607+00:00

Hello,

I am trying to set up Multi-Instance GPU (MIG) as described here: https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance on my Azure Kubernetes Cluster.

The nodepool that I am trying this on uses the Standard_NC24ads_A100_v4 instance size.
When attempting to create the nodepool with the desired gpu-instance-profile option, after waiting for a good amount of time, I am getting the following error:

{  
  "name": "e4cb40c0-dff3-8a43-a400-1cc3e37ed80b",  
  "status": "Failed",  
  "startTime": "2022-10-24T15:53:37.6075844Z",  
  "endTime": "2022-10-24T16:20:27.106602Z",  
  "error": {  
   "code": "ReconcileVMSSAgentPoolFailed",  
   "message": "We are unable to serve this request due to an internal error, Correlation ID: 42c71e80-8830-4b67-8155-b957d79e5003, Operation ID: c040cbe4-f3df-438a-a400-1cc3e37ed80b, Timestamp: 2022-10-24T16:20:27Z."  
  }  
 }  

The command I am running is the following:

az aks nodepool add --name imqpoolmig --resource-group <resource-group> --cluster-name <cluster-name> --node-vm-size Standard_NC24ads_A100_v4 --node-taints app=imq:PreferNoSchedule --labels app=imq --node-count 1 --gpu-instance-profile mig1g --debug  

I have attempted to run the command with all the supported options for gpu-instance-profile and got the same result.

The kubernetes version is: 1.24.3.

EDIT:
My interests are the following:

  1. Using the Standard_NC24ads_A100_v4 node size
  2. Being able to set the Multi Instance GPU profile using the gpu-instance-profile argument.

I want to mention that if I do not add the gpu-instance-profile argument, the command executes correctly, and I am able to create and scale up my nodepool.

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,999 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. SUNOJ KUMAR YELURU 14,021 Reputation points MVP
    2022-10-25T01:40:26.833+00:00

    Hi @Bogdan Serban

    Thanks for reaching Q & A forum.

    Add a node pool for GPU nodes

    Try using the below command

    az aks nodepool add \  
        --resource-group myResourceGroup \  
        --cluster-name myAKSCluster \  
        --name gpunp \  
        --node-count 1 \  
        --node-vm-size Standard_NC6 \  
        --node-taints sku=gpu:NoSchedule \  
        --aks-custom-headers UseGPUDedicatedVHD=true \  
        --enable-cluster-autoscaler \  
        --min-count 1 \  
        --max-count 3  
    

    ----
    If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.