Cannot schedule Pods in AKS GPU node

David Giron 41 Reputation points

I have a Nodepool with Spot GPU nodes NC4as_T4_v3 and cluster autoscaling 0-1 .

After scheduling a Pod with Request , the Node would spawn, but it has this taint:

The nodepool does not launch any other new nodes and my Pod keeps in Pending state.

What is this taint and how can I prevent this happening, or fixing this problem?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,935 questions
{count} votes

Accepted answer
  1. srbhatta-MSFT 8,546 Reputation points Microsoft Employee

    Hi @David Giron ,
    Thanks for reaching out to Microsoft QnA.
    This seems like the AKS remediator has added this taint. Without having a look at the logs, it will be difficult for us to confirm or state a reason for this.
    Have you checked the kubelet or control plane logs and were you able to find anything there? You can refer to the below links.

    I researched a bit more and checked internally. Seems that this is most likely going to be the autodrain feature with spot - Automatically repairing Azure Kubernetes Service (AKS) nodes - Azure Kubernetes Service | Microsoft Learn .

    Hope this helps!


    Please don't forget to 179759-accept.png and 179670-upvote.png if you think the information provided was useful so that it can help others in the community looking for help on similar issues.

    0 comments No comments

4 additional answers

Sort by: Most helpful
  1. David Giron 41 Reputation points

    I couldn't access the kubelet logs since the node was in NotReady state.
    Your explanation regarding autodrain node for Spot makes sense, so I assume that's the cause (Spot capacity dropped and the node gets removed / Preempted).
    Still I expect the node gets back to Ready once there is capacity back again, but instead I had to manually scale the nodepool to get the node back again.

    0 comments No comments

  2. Niels Claeys 1 Reputation point

    I had exactly the same issue twice with my cluster. No new nodes were provisioned for about a week because the autoscaler thought that the nodepool was tainted.
    In our cases it got fixed by:

    • manually scaling the nodepool as David says
    • autoscaler being restarted due to cluster upgrade

    It seems that when the last node for the nodepool gets drained and the remediator taint gets added, the state of the nodepool in the autoscaler never gets updated/refreshed and thus it keeps thinking that the nodepool cannot be used. Can anyone confirm this?

    0 comments No comments

  3. Michael Taron 1 Reputation point

    We are also hitting this issue - since about noon yesterday no new nodes GPU nodes were provisioned in a specific node pool. Upgrading the node pool images didn't help (already at the latest non-preview k8s version, so didn't want to upgrade the cluster), but manually scaling the node pool like Niels suggested seems to have done the trick.

    0 comments No comments

  4. Patrick du Boucher-Ryan 1 Reputation point

    Hit this today - though strangely, we could see the kubelet wouldn't schedule a pod because of this taint - but there were no nodes, or nodepools with this taint.

    1 node(s) had taint { }

    the taint couldn't be found with kubectl, az aks or in the portal.

    0 comments No comments