Quota limits puts AKS cluster into failed state

Luke Briner 0 Reputation points
2023-02-24T15:48:38.41+00:00

I wanted to upgrade my Kubernetes version on a node pool. When I clicked to upgrade, an error was displayed that I reached my quota. This was correct but the upgrade action has put my cluster into a failed state, I cannot retry the upgrade. It's been around an hour and the cluster hasn't recovered so I am stuck waiting.

This should be simple to fix: Either pre-check the quota and simply display a message "Cannot upgrade cluster until quota is raised" and don't put the cluster into an errored state OR accept that upgrading a cluster will temporarily require additional resources and allow the upgrade since after the upgrade, you will be back within limits.

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,840 questions
{count} votes

3 answers

Sort by: Most helpful
  1. Sam Cogan 10,082 Reputation points MVP
    2023-02-24T16:42:11.7066667+00:00

    To get your cluster out of a failed state you can run an upgrade with the same version it is currently on, using the CLI. So if you are currently on 1.25.5 you would run:

    az aks upgrade --resource-group myResourceGroup --name myAKSCluster --kubernetes-version 1.25.5
    

    This should get your cluster out of the failed state. To do the upgrade to the version your require you will need to raise your quota.

    1 person found this answer helpful.
    0 comments No comments

  2. KarishmaTiwari-MSFT 18,337 Reputation points Microsoft Employee
    2023-02-27T08:41:07.58+00:00

    Hi Luke Briner , Thanks for posting your query on Microsoft Q&A.
    I'm sorry to hear that you're having trouble upgrading your AKS cluster. I understand your concern around the error message. It should be displayed before putting the cluster into failed state. I will be sharing your feedback with the product team internally.
    Why it could have failed?

    An AKS cluster upgrade triggers a cordon and drain of your nodes. If you have a low compute quota available, the upgrade may fail.
    According to the documentation, By default, AKS configures upgrades to surge with one extra node. **Node surges require subscription quota for the requested max surge count for each upgrade operation. **
    For example, a cluster that has 5 node pools, each with a count of 4 nodes, has a total of 20 nodes. If each node pool has a max surge value of 50%, additional compute and IP quota of 10 nodes (2 nodes * 5 pools) is required to complete the upgrade.

    If you're using Azure CNI, you also need to validate that there are available IPs in the subnet to satisfy the IP requirements of Azure CNI.

    The default max surge value is one, which minimizes workload disruption by creating an extra node before the cordon/drain of existing applications to replace an older versioned node. The max surge value can be customized per node pool to enable a trade-off between upgrade speed and upgrade disruption. If you increase the max surge value, the upgrade process will complete faster, but setting a large value for max surge may cause disruptions during the upgrade process. For example, a max surge value of 100% provides the fastest possible upgrade process (doubling the node count) but also causes all nodes in the node pool to be drained simultaneously. For production node pools, it's recommended to use a max surge setting of 33%
    How to solve your current issue?

    1. As shared in the answer provided by Sam, please try the steps to get the cluster out of failed state and let me know if it is still failing. I can further troubleshoot and engage the right teams as needed.
    2. To raise the limit or quota for your subscription, go to the Azure portal and file a Service and subscription limits (quotas) support ticket. In this case, you have to submit a support ticket to increase the quota for compute cores.
      Once that is completed, you can try the upgrade again.

    Additional reading:

    If you have any questions, please let me know in the "comments."

    0 comments No comments

  3. Eddie Neto 1,236 Reputation points Microsoft Employee
    2023-02-27T09:16:26.1433333+00:00

    @Luke Briner

    Thanks for reaching Microsoft Q&A.

    Regarding your issue "Quota". Could you please check on your subscription if you have quota configures to use those type of SKU machines.

    If you don't have for any region, you must request and then you will be able to upgrade /create with the required size. Below picture to check if you have machines available with required quota on your subscription.

    Just need to search for the machine size that you are looking for.

    Hope this helps. Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

    User's image

    0 comments No comments