Share via

Node not able to schedule

Manshu 0 Reputation points
2025-09-03T14:03:57.9366667+00:00

Hi Team Actually we are facing an issue in scheduling nodes in node pools in the AKS cluster.

Our Goal is to create node pools of NCads H100 v5 Series nodes inside our AKS cluster

  • Subscription ID: ad**********************02
  • Region: US West 2

Thanks Manshu Sharma

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


2 answers

Sort by: Most helpful
  1. Hemalatha 14,525 Reputation points Microsoft External Staff Moderator
    2025-09-07T13:41:51.3133333+00:00

    Hi Manshu,

    Based on the information you provided, and known challenges related to this issue is most likely below factors:

    1.Even though your subscription has sufficient vCPU quota and the SKU is listed as available, the physical capacity in that region may be exhausted at the time of your deployment attempt.

    2.From the output you have shared in that H100s are available in zones 1 and 3 in westus2. By default, AKS may try to use all available zones for a VMSS-backed node pool. To increase the chances of a successful deployment, explicitly target an availability zone to your command.

    Try running the command by specifying a zone that you know has the SKU:

    az aks nodepool add \ 
    --resource-group voicing-production \ 
    --cluster-name voicing-aks \ 
    --name h100pool \ 
    --node-count 1 \ 
    --node-vm-size Standard_NC40ads_H100_v5 \ 
    --os-sku Ubuntu \ 
    --os-type Linux \ 
    --node-taints sku=gpu:NoSchedule
    --zone 1 or 3
    
    1. At last, the issue might be related to the specific taint (sku=gpu:NoSchedule) you've applied in the node pool. The taint essentially ensures that only pods with the corresponding toleration can be scheduled on these nodes.

    So, verify whether tolerations have been correctly configured in the pod specification file.

    User's image

    Please find the below related official documentations:

    https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/

    https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-advanced-scheduler#use-taints-and-tolerations

    https://learn.microsoft.com/en-us/azure/aks/quotas-skus-regions

    Hope this helps! Please let me know if you have any queries.

    Was this answer helpful?


  2. Himanshu Shekhar 6,710 Reputation points Microsoft External Staff Moderator
    2025-09-03T17:20:02.2433333+00:00

    Hello Manshu

    Welcome to Microsoft Q&A Platform. Thank you for reaching out & hope you are doing well.

    please let us know few details to investigate further:

    1. If have you raised a quota request for Standard_NCads_H100_v5 in your subscription and in US West 2 specifically? Also let us know the VM size (SKU)?
    2. Is it possible trying alternative regions (e.g., East US, South Central US, or West Europe) to test availability?
    3. Is AKS cluster configured for Virtual Machine Scale Sets (VMSS), not Availability Sets?

    Although the vCPU quota for your subscription is sufficient and the requested SKU appears as available, physical capacity for that VM size might be exhausted in the selected region at the time of deployment.

    This can result in allocation failures despite quota availability.

    Please see Microsoft’s documentation on VM allocation failures:

    https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/windows/allocation-failure

    The output you provided indicates that H100s are available only in zones 1 and 3 within the westus2 region.

    By default, AKS attempts to use all available zones for a VMSS-backed node pool, which may lead to failures if the required SKU is not present in every zone.

    To improve deployment success, we need to specify a supported availability zone directly in command.

    Please see Microsoft’s documentation on configuring AKS node pools and availability zones for best practices:

    https://learn.microsoft.com/en-us/azure/aks/reliability-availability-zones-configure

    https://docs.azure.cn/en-us/aks/reliability-zone-resiliency-recommendations

    Regards

    Himanshu

    Was this answer helpful?

    0 comments No comments

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.