Hi Luke Briner , Thanks for posting your query on Microsoft Q&A.
I'm sorry to hear that you're having trouble upgrading your AKS cluster. I understand your concern around the error message. It should be displayed before putting the cluster into failed state. I will be sharing your feedback with the product team internally.
Why it could have failed?
An AKS cluster upgrade triggers a cordon and drain of your nodes. If you have a low compute quota available, the upgrade may fail.
According to the documentation, By default, AKS configures upgrades to surge with one extra node. **Node surges require subscription quota for the requested max surge count for each upgrade operation. **
For example, a cluster that has 5 node pools, each with a count of 4 nodes, has a total of 20 nodes. If each node pool has a max surge value of 50%, additional compute and IP quota of 10 nodes (2 nodes * 5 pools) is required to complete the upgrade.
If you're using Azure CNI, you also need to validate that there are available IPs in the subnet to satisfy the IP requirements of Azure CNI.
The default max surge value is one, which minimizes workload disruption by creating an extra node before the cordon/drain of existing applications to replace an older versioned node. The max surge value can be customized per node pool to enable a trade-off between upgrade speed and upgrade disruption. If you increase the max surge value, the upgrade process will complete faster, but setting a large value for max surge may cause disruptions during the upgrade process. For example, a max surge value of 100% provides the fastest possible upgrade process (doubling the node count) but also causes all nodes in the node pool to be drained simultaneously. For production node pools, it's recommended to use a max surge setting of 33%
How to solve your current issue?
- As shared in the answer provided by Sam, please try the steps to get the cluster out of failed state and let me know if it is still failing. I can further troubleshoot and engage the right teams as needed.
- To raise the limit or quota for your subscription, go to the Azure portal and file a Service and subscription limits (quotas) support ticket. In this case, you have to submit a support ticket to increase the quota for compute cores.
Once that is completed, you can try the upgrade again.
Additional reading:
If you have any questions, please let me know in the "comments."