An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Hi Manshu,
Based on the information you provided, and known challenges related to this issue is most likely below factors:
1.Even though your subscription has sufficient vCPU quota and the SKU is listed as available, the physical capacity in that region may be exhausted at the time of your deployment attempt.
2.From the output you have shared in that H100s are available in zones 1 and 3 in westus2. By default, AKS may try to use all available zones for a VMSS-backed node pool. To increase the chances of a successful deployment, explicitly target an availability zone to your command.
Try running the command by specifying a zone that you know has the SKU:
az aks nodepool add \
--resource-group voicing-production \
--cluster-name voicing-aks \
--name h100pool \
--node-count 1 \
--node-vm-size Standard_NC40ads_H100_v5 \
--os-sku Ubuntu \
--os-type Linux \
--node-taints sku=gpu:NoSchedule
--zone 1 or 3
- At last, the issue might be related to the specific taint (sku=gpu:NoSchedule) you've applied in the node pool. The taint essentially ensures that only pods with the corresponding toleration can be scheduled on these nodes.
So, verify whether tolerations have been correctly configured in the pod specification file.
Please find the below related official documentations:
https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
https://learn.microsoft.com/en-us/azure/aks/quotas-skus-regions
Hope this helps! Please let me know if you have any queries.