Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Summary
This article explains how to identify and resolve UpgradeFailed errors due to eviction failures caused by Pod Disruption Budgets (PDBs) that occur when you try to upgrade an Azure Kubernetes Service (AKS) cluster.
Prerequisites
This article requires Azure CLI version 2.67.0 or a later version. To find the version number, run az --version. If you need to install or upgrade Azure CLI, see How to install the Azure CLI.
For more detailed information about the upgrade process, see the "Upgrade an AKS cluster" section in Upgrade an Azure Kubernetes Service (AKS) cluster.
Symptoms
An AKS cluster upgrade operation fails with one of the following error messages:
-
(UpgradeFailed) Drain
node aks-<nodepool-name>-xxxxxxxx-vmssxxxxxxfailed when evicting pod<pod-name>failed with Too Many Requests error. This error is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info:<namespace>/<pod-name>blocked by pdb<pdb-name>with 0 unready pods. -
Code: UpgradeFailed
Message: Drain nodeaks-<nodepool-name>-xxxxxxxx-vmssxxxxxxfailed when evicting pod<pod-name>failed with Too Many Requests error. This error is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info:<namespace>/<pod-name>blocked by pdb<pdb-name>with 0 unready pods.
Cause
This error occurs if a pod is protected by the Pod Disruption Budget (PDB) policy. In this situation, the pod resists being drained. After several attempts, the upgrade operation fails, and the cluster or node pool falls into a Failed state.
Check the PDB configuration: ALLOWED DISRUPTIONS value. The value should be 1 or greater. For more information, see Plan for availability using pod disruption budgets. For example, you can check the workload and its PDB as follows. You should observe the ALLOWED DISRUPTIONS column doesn't allow any disruption. If the ALLOWED DISRUPTIONS value is 0, the pods aren't evicted and node drain fails during the upgrade process:
$ kubectl get deployments.apps nginx
NAME READY UP-TO-DATE AVAILABLE AGE
nginx 2/2 2 2 62s
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
nginx-7854ff8877-gbr4m 1/1 Running 0 68s
nginx-7854ff8877-gnltd 1/1 Running 0 68s
$ kubectl get pdb
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-pdb 2 N/A 0 24s
You can also check for any entries in Kubernetes events by using the command kubectl get events | grep -i drain. A similar output shows the message "Eviction blocked by Too Many Requests (usually a pdb)":
$ kubectl get events | grep -i drain
LAST SEEN TYPE REASON OBJECT MESSAGE
(...)
32m Normal Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Draining node: aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
2m57s Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
12m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
31m Warning Drain node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
To resolve this issue, use one of the following solutions.
Solution 1: Enable pods to drain
Adjust the PDB to enable pod draining. Generally, the allowed disruption is controlled by the
Min Available / Max unavailableorRunning pods / Replicasparameter. Modify theMin Available / Max unavailableparameter at the PDB level or increase the number ofRunning pods / Replicasto push the Allowed Disruption value to 1 or greater.Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
Solution 2: Back up, delete, and redeploy the PDB
Note
Use this solution if editing the PDB resource isn't a viable option.
Back up the PDBs by running the following command:
kubectl get pdb <pdb-name> -n <pdb-namespace> -o yaml > pdb-name-backup.yamlandDelete the PDB by running the following command:
kubectl delete pdb <pdb-name> -n <pdb-namespace>After the new upgrade attempt finishes, redeploy the PDB by applying the backup file using the following command:
kubectl apply -f pdb-name-backup.yaml.Try to upgrade the AKS cluster to the same version again that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y
Solution 3: Delete the pods that you can't drain or scale the workload down to zero
- Delete the pods that you can't drain.
Note
If a Deployment or StatefulSet creates the pods, a ReplicaSet controls them. If that's the case, you might need to delete or scale the workload replicas to zero for the Deployment or StatefulSet. Before you make this change, back up the resource by running: kubectl get <deployment.apps -or- statefulset.apps> <name> -n <namespace> -o yaml > backup.yaml.
To scale down the workload, use
kubectl scale --replicas=0 <deployment.apps -or- statefulset.apps> <name> -n <namespace>.Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.
$ az aks upgrade --name <aksName> --resource-group <resourceGroupName> Are you sure you want to perform this operation? (y/N): y Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state. Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y