Azure AKS Multi Instance GPU NodePool Failure

Question

Hello,

I am trying to set up Multi-Instance GPU (MIG) as described here: https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance on my Azure Kubernetes Cluster.

The nodepool that I am trying this on uses the Standard_NC24ads_A100_v4 instance size.
When attempting to create the nodepool with the desired gpu-instance-profile option, after waiting for a good amount of time, I am getting the following error:

{  
  "name": "e4cb40c0-dff3-8a43-a400-1cc3e37ed80b",  
  "status": "Failed",  
  "startTime": "2022-10-24T15:53:37.6075844Z",  
  "endTime": "2022-10-24T16:20:27.106602Z",  
  "error": {  
   "code": "ReconcileVMSSAgentPoolFailed",  
   "message": "We are unable to serve this request due to an internal error, Correlation ID: 42c71e80-8830-4b67-8155-b957d79e5003, Operation ID: c040cbe4-f3df-438a-a400-1cc3e37ed80b, Timestamp: 2022-10-24T16:20:27Z."  
  }  
 }

The command I am running is the following:

az aks nodepool add --name imqpoolmig --resource-group  --cluster-name  --node-vm-size Standard_NC24ads_A100_v4 --node-taints app=imq:PreferNoSchedule --labels app=imq --node-count 1 --gpu-instance-profile mig1g --debug

I have attempted to run the command with all the supported options for gpu-instance-profile and got the same result.

The kubernetes version is: 1.24.3.

EDIT:
My interests are the following:

Using the Standard_NC24ads_A100_v4 node size
Being able to set the Multi Instance GPU profile using the gpu-instance-profile argument.

I want to mention that if I do not add the gpu-instance-profile argument, the command executes correctly, and I am able to create and scale up my nodepool.

Answer

Hi @Bogdan Serban

Thanks for reaching Q & A forum.

Add a node pool for GPU nodes

Try using the below command

az aks nodepool add \  
    --resource-group myResourceGroup \  
    --cluster-name myAKSCluster \  
    --name gpunp \  
    --node-count 1 \  
    --node-vm-size Standard_NC6 \  
    --node-taints sku=gpu:NoSchedule \  
    --aks-custom-headers UseGPUDedicatedVHD=true \  
    --enable-cluster-autoscaler \  
    --min-count 1 \  
    --max-count 3

----
If this answers your query, do click Accept Answer and Up-Vote for the same. And, if you have any further query do let us know.

Share via

Azure AKS Multi Instance GPU NodePool Failure

1 answer