Hello,
I am trying to set up Multi-Instance GPU (MIG) as described here: https://learn.microsoft.com/en-us/azure/aks/gpu-multi-instance on my Azure Kubernetes Cluster.
The nodepool that I am trying this on uses the Standard_NC24ads_A100_v4 instance size.
When attempting to create the nodepool with the desired gpu-instance-profile option, after waiting for a good amount of time, I am getting the following error:
{
"name": "e4cb40c0-dff3-8a43-a400-1cc3e37ed80b",
"status": "Failed",
"startTime": "2022-10-24T15:53:37.6075844Z",
"endTime": "2022-10-24T16:20:27.106602Z",
"error": {
"code": "ReconcileVMSSAgentPoolFailed",
"message": "We are unable to serve this request due to an internal error, Correlation ID: 42c71e80-8830-4b67-8155-b957d79e5003, Operation ID: c040cbe4-f3df-438a-a400-1cc3e37ed80b, Timestamp: 2022-10-24T16:20:27Z."
}
}
The command I am running is the following:
az aks nodepool add --name imqpoolmig --resource-group <resource-group> --cluster-name <cluster-name> --node-vm-size Standard_NC24ads_A100_v4 --node-taints app=imq:PreferNoSchedule --labels app=imq --node-count 1 --gpu-instance-profile mig1g --debug
I have attempted to run the command with all the supported options for gpu-instance-profile and got the same result.
The kubernetes version is: 1.24.3.
EDIT:
My interests are the following:
- Using the Standard_NC24ads_A100_v4 node size
- Being able to set the Multi Instance GPU profile using the gpu-instance-profile argument.
I want to mention that if I do not add the gpu-instance-profile argument, the command executes correctly, and I am able to create and scale up my nodepool.