Upgrade options and recommendations for Azure Kubernetes Service (AKS) clusters

2025-07-12

This article gives you a technical foundation for AKS cluster upgrades, covering upgrade options and common scenarios. For in-depth guidance tailored to your needs, use the scenario-based navigation paths at the end of this article.

📖 What This Article Covers

This technical reference provides comprehensive AKS upgrade fundamentals:

Manual vs. automated upgrade options and when to use each
Common upgrade scenarios with specific recommendations
Optimization techniques for performance and minimal disruption
Troubleshooting guidance for capacity, drain failures, and timing issues
Validation processes and pre-upgrade checks

Best for: Understanding upgrade mechanics, troubleshooting issues, optimizing upgrade settings, technical implementation details.

Related guides: Production strategies • Stateful workloads • Scenario hub

New to AKS upgrades? Start with our Upgrade Scenarios Hub for guided, scenario-based assistance.

Your Situation	Recommended Path
Production cluster needing upgrade	Production Upgrade Strategies
Database/stateful workloads	Stateful Workload Patterns
First-time upgrade or basic cluster	Basic AKS cluster upgrade
Multiple environments or fleet	Upgrade Scenarios Hub
Node pools or Windows nodes	Node pool upgrades
Specific node pool only	Single node pool upgrade

Upgrade options

Perform manual upgrades

Manual upgrades let you control when your cluster upgrades to a new Kubernetes version. Useful for testing or targeting a specific version.

Configure automatic upgrades

Automatic upgrades keep your cluster on a supported version and up to date. This is when you want to set and forget.

Special considerations for node pools spanning multiple availability zones

AKS uses best-effort zone balancing in node groups. During an upgrade surge, the zones for surge nodes in Virtual Machine Scale Sets are unknown ahead of time, which can temporarily cause an unbalanced zone configuration. AKS deletes surge nodes after the upgrade and restores the original zone balance. To keep zones balanced, set surge to a multiple of three nodes. PVCs using Azure LRS disks are zone-bound and may cause downtime if surge nodes are in a different zone. Use a Pod Disruption Budget to maintain high availability during drains.

Optimize upgrades to improve performance and minimize disruptions

Combine Planned Maintenance Window, Max Surge, Pod Disruption Budget, node drain timeout, and node soak time to increase the likelihood of successful, low-disruption upgrades.

Planned Maintenance Window: Schedule auto-upgrade during low-traffic periods (recommend at least four hours)
Max Surge: Higher values speed upgrades but may disrupt workloads; 33% is recommended for production
Max Unavailable: Use when capacity is limited
Pod Disruption Budget: Set to limit pods down during upgrades; validate for your service
Node drain timeout: Configure pod eviction wait duration (default 30 minutes)
Node soak time: Stagger upgrades to minimize downtime (default 0 minutes)

Upgrade settings	How extra nodes are used	Expected behavior
`maxSurge=5`, `maxUnavailable=0`	5 surge nodes	5 nodes surged for upgrade
`maxSurge=5`, `maxUnavailable=0`	0-4 surge nodes	Upgrade fails due to insufficient surge nodes
`maxSurge=0`, `maxUnavailable=5`	N/A	5 existing nodes drained for upgrade

Note

Before upgrading, check for API breaking changes and review the AKS release notes to avoid disruptions.

Validations used in the upgrade process

AKS performs pre-upgrade validations to ensure cluster health:

API breaking changes: Detects deprecated APIs.
Kubernetes upgrade version: Ensures valid upgrade path.
PDB configuration: Checks for misconfigured PDBs (e.g., maxUnavailable=0).
Quota: Confirms enough quota for surge nodes.
Subnet: Verifies sufficient IP addresses.
Certificates/Service Principals: Detects expired credentials.

These checks help minimize upgrade failures and provide early visibility into issues.

Common upgrade scenarios and recommendations

Scenario 1: Capacity constraints

If your cluster is limited by SKU or regional capacity, upgrades might fail when surge nodes can't be provisioned. This is common with specialized SKUs (like GPU nodes) or in regions with limited resources. Errors such as SKUNotAvailable, AllocationFailed, or OverconstrainedAllocationRequest might occur if maxSurge is set too high for available capacity.

Recommendations to prevent or resolve

Use maxUnavailable to upgrade using existing nodes instead of surging new ones. Learn more.
Lower maxSurge to reduce extra capacity needs. Learn more.
For security-only updates, use security patch reimages that don't require surge nodes. Learn more.

Scenario 2: Node drain failures and PDBs

Upgrades require draining nodes (evicting pods). Drains can fail if:

Pods are slow to terminate (long shutdown hooks or persistent connections).
Strict Pod Disruption Budgets (PDBs) block pod evictions.

Example error message:

Code: UpgradeFailed
Message: Drain node ... failed when evicting pod ... failed with Too Many Requests error. This is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget. PDB debug info: ... blocked by pdb ... with 0 unready pods.

Recommendations to prevent or resolve

Set maxUnavailable in PDBs to allow at least one pod to be evicted.
Increase pod replicas so the disruption budget can tolerate evictions.
Use undrainableNodeBehavior to allow upgrades to proceed even if some nodes can't be drained:
- Schedule (Default): Node and surge replacement may be deleted, reducing capacity.
- Cordon (Recommended): Node is cordoned and labeled as kubernetes.azure.com/upgrade-status=Quarantined.
  - Example command:
```
az aks nodepool update \
  --resource-group <resource-group-name> \
  --cluster-name <cluster-name> \
  --name <node-pool-name> \
  --undrainable-node-behavior Cordon
```
    The following example output shows the undrainable node behavior updated:
```
"upgradeSettings": {
"drainTimeoutInMinutes": null,
"maxSurge": "1",
"nodeSoakDurationInMinutes": null,
"undrainableNodeBehavior": "Cordon"
}
```

Max Blocked Nodes Allowed (Preview)

[Preview] The Max Blocked Nodes Allowed feature lets you specify how many nodes that fail to drain (blocked nodes) can be tolerated during upgrades or similar operations. This feature only works if the undrainable node behavior property is set; otherwise, the command will return an error.

Note

If you do not explicitly set Max Blocked Nodes Allowed, it defaults to the value of Max Surge. If Max Surge is not set, the default is typically 10%, so Max Blocked Nodes Allowed also defaults to 10%.

Prerequisites

Azure CLI aks-preview extension version 18.0.0b9 or later is required to use this feature.

Example command:

az aks nodepool update \
  --cluster-name jizenMC1 \
  --name nodepool1 \
  --resource-group jizenTestMaxBlockedNodesRG \
  --max-surge 1 \
  --undrainable-node-behavior Cordon \
  --max-blocked-nodes 2 \
  --drain-timeout 5

Extend drain timeout if workloads need more time (default is 30 minutes).
Test PDBs in staging, monitor upgrade events, and use blue-green deployments for critical workloads. Learn more.

Verifying undrainable nodes

The blocked nodes are unscheduled for pods and marked with the label "kubernetes.azure.com/upgrade-status: Quarantined".
Verify the label on any blocked nodes when there's a drain node failure on upgrade:
```
kubectl get nodes --show-labels=true
```

Resolving undrainable nodes

Remove the responsible PDB:
```
kubectl delete pdb <pdb-name>
```
Remove the kubernetes.azure.com/upgrade-status: Quarantined label:
```
kubectl label nodes <node-name> <label-name>
```

Optionally, delete the blocked node:

az aks nodepool delete-machines --cluster-name <cluster-name> --machine-names <machine-name> --name <node-pool-name> --resource-group <resource-group-name>

After you complete this step, you can reconcile the cluster status by performing any update operation without the optional fields as outlined here. Alternatively, you can scale the node pool to the same number of nodes as the count of upgraded nodes. This action ensures the node pool gets to its intended original size. AKS prioritizes the removal of the blocked nodes. This command also restores the cluster provisioning status to Succeeded. In the example given, 2 is the total number of upgraded nodes.
```
# Update the cluster to restore the provisioning status
az aks update --resource-group <resource-group-name> --name <cluster-name>

# Scale the node pool to restore the original size
az aks nodepool scale --resource-group <resource-group-name> --cluster-name <cluster-name> --name <node-pool-name> --node-count 2
```

Scenario 3: Slow upgrades

Upgrades can be delayed by conservative settings or node-level issues, impacting your ability to stay current with patches and improvements.

Common causes of slow upgrades include:

Low maxSurge or maxUnavailable values (limits parallelism).
High soak times (long waits between node upgrades).
Drain failures (see Node drain failures]).

Recommendations to prevent or resolve

For production: maxSurge=33%, maxUnavailable=1.
For dev/test: maxSurge=50%, maxUnavailable=2.
Use OS Security Patch for fast, targeted patching (avoids full node reimaging).
Enable undrainableNodeBehavior to avoid upgrade blockers.

Scenario 4: IP exhaustion

Surge nodes require additional IPs. If the subnet is near capacity, node provisioning can fail (e.g., Error: SubnetIsFull). This is common with Azure CNI, high maxPods, or large node counts.

Recommendations to prevent or resolve

Ensure your subnet has enough IPs for all nodes, surge nodes, and pods:
- Formula: Total IPs = (Number of nodes + maxSurge) * (1 + maxPods)
Reclaim unused IPs or expand the subnet (e.g., from /24 to /22).

Lower maxSurge if subnet expansion isn't possible:

az aks nodepool update \
  --resource-group <resource-group-name> \
  --cluster-name <cluster-name> \
  --name <node-pool-name> \
  --max-surge 10%

Monitor IP usage with Azure Monitor or custom alerts.
Reduce maxPods per node, clean up orphaned load balancer IPs, and plan subnet sizing for high-scale clusters.

❓ Frequently Asked Questions

Can I use open-source tools for validation?

Yes! Many open-source tools integrate well with AKS upgrade processes:

kube-no-trouble (kubent) - Scans for deprecated APIs before upgrades
Trivy - Security scanning for container images and Kubernetes configurations
Sonobuoy - Kubernetes conformance testing and cluster validation
kube-bench - Security benchmark checks against CIS standards
Polaris - Validation of Kubernetes best practices
kubectl-neat - Clean up Kubernetes manifests for validation

How do I validate API compatibility before upgrading?

Run deprecation checks using tools like kubent:

# Install and run API deprecation scanner
kubectl apply -f https://github.com/doitintl/kube-no-trouble/releases/latest/download/knt-full.yaml

# Check for deprecated APIs in your cluster
kubectl run knt --image=doitintl/knt:latest --rm -it --restart=Never -- \
  -c /kubeconfig -o json > api-deprecation-report.json

# Review findings
cat api-deprecation-report.json | jq '.[] | select(.deprecated==true)'

What makes AKS upgrades different from other Kubernetes platforms?

AKS provides several unique advantages:

Native Azure integration with Traffic Manager, Load Balancer, and networking
Azure Fleet Manager for coordinated multi-cluster upgrades
Automatic node image patching without manual node management
Built-in validation for quota, networking, and credentials
Azure support for upgrade-related issues

Now Choose Your Upgrade Path

This article provided the technical foundation - now select your scenario-based path:

🚀 Ready to Execute?

If you have...	Then go to...
Production environment	Production Upgrade Strategies - Battle-tested patterns for zero-downtime upgrades
Databases or stateful apps	Stateful Workload Patterns - Safe upgrade patterns for data persistence
Multiple environments	Upgrade Scenarios Hub - Decision tree for complex setups
Basic cluster	Upgrade an AKS cluster - Step-by-step cluster upgrade

🔍 Still Deciding?

Visit the Upgrade Scenarios Hub for a guided decision tree that considers your:

Downtime tolerance
Environment complexity
Risk profile
Timeline constraints

Next steps

Review AKS patch and upgrade guidance for best practices and planning tips before starting any upgrade.
Always check for API breaking changes and validate your workloads' compatibility with the target Kubernetes version.
Test upgrade settings (such as maxSurge, maxUnavailable, and PDBs) in a staging environment to minimize production risk.
Monitor upgrade events and cluster health throughout the process.

Share via

Upgrade options and recommendations for Azure Kubernetes Service (AKS) clusters

📖 What This Article Covers

🎯 Quick Navigation

Upgrade options

Perform manual upgrades

Configure automatic upgrades

Special considerations for node pools spanning multiple availability zones

Optimize upgrades to improve performance and minimize disruptions

Validations used in the upgrade process

Common upgrade scenarios and recommendations

Scenario 1: Capacity constraints

Recommendations to prevent or resolve

Scenario 2: Node drain failures and PDBs

Recommendations to prevent or resolve

Max Blocked Nodes Allowed (Preview)

Scenario 3: Slow upgrades

Recommendations to prevent or resolve

Scenario 4: IP exhaustion

Recommendations to prevent or resolve

❓ Frequently Asked Questions

Can I use open-source tools for validation?

How do I validate API compatibility before upgrading?

What makes AKS upgrades different from other Kubernetes platforms?

Now Choose Your Upgrade Path

🚀 Ready to Execute?

🔍 Still Deciding?

Next steps

Feedback

Additional resources