Troubleshoot UpgradeFailed errors due to eviction failures caused by PDBs

Summary

This article explains how to identify and resolve UpgradeFailed errors due to eviction failures caused by Pod Disruption Budgets (PDBs) that occur when you try to upgrade an Azure Kubernetes Service (AKS) cluster.

Prerequisites

This article requires Azure CLI version 2.67.0 or a later version. To find the version number, run az --version. If you need to install or upgrade Azure CLI, see How to install the Azure CLI.

For more detailed information about the upgrade process, see the "Upgrade an AKS cluster" section in Upgrade an Azure Kubernetes Service (AKS) cluster.

Symptoms

An AKS cluster upgrade operation fails with one of the following error messages:

(UpgradeFailed) Drain node aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx failed when evicting pod <pod-name> failed with Too Many Requests error. This error is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: <namespace>/<pod-name> blocked by pdb <pdb-name> with 0 unready pods.
Code: UpgradeFailed
Message: Drain node aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx failed when evicting pod <pod-name> failed with Too Many Requests error. This error is often caused by a restrictive Pod Disruption Budget (PDB) policy. See https://aka.ms/aks/debugdrainfailures. Original error: Cannot evict pod as it would violate the pod's disruption budget.. PDB debug info: <namespace>/<pod-name> blocked by pdb <pdb-name> with 0 unready pods.

Cause

This error occurs if a pod is protected by the Pod Disruption Budget (PDB) policy. In this situation, the pod resists being drained. After several attempts, the upgrade operation fails, and the cluster or node pool falls into a Failed state.

Check the PDB configuration: ALLOWED DISRUPTIONS value. The value should be 1 or greater. For more information, see Plan for availability using pod disruption budgets. For example, you can check the workload and its PDB as follows. You should observe the ALLOWED DISRUPTIONS column doesn't allow any disruption. If the ALLOWED DISRUPTIONS value is 0, the pods aren't evicted and node drain fails during the upgrade process:

$ kubectl get deployments.apps nginx
NAME    READY   UP-TO-DATE   AVAILABLE   AGE
nginx   2/2     2            2           62s

$ kubectl get pod
NAME                     READY   STATUS    RESTARTS   AGE
nginx-7854ff8877-gbr4m   1/1     Running   0          68s
nginx-7854ff8877-gnltd   1/1     Running   0          68s

$ kubectl get pdb
NAME        MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
nginx-pdb   2               N/A               0                     24s

You can also check for any entries in Kubernetes events by using the command kubectl get events | grep -i drain. A similar output shows the message "Eviction blocked by Too Many Requests (usually a pdb)":

$ kubectl get events | grep -i drain
LAST SEEN   TYPE      REASON                    OBJECT                                   MESSAGE
(...)
32m         Normal    Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Draining node: aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx
2m57s       Warning   Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
12m         Warning   Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m         Warning   Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
32m         Warning   Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Eviction blocked by Too Many Requests (usually a pdb): <pod-name>
31m         Warning   Drain                     node/aks-<nodepool-name>-xxxxxxxx-vmssxxxxxx   Eviction blocked by Too Many Requests (usually a pdb): <pod-name>

To resolve this issue, use one of the following solutions.

Solution 1: Enable pods to drain

Adjust the PDB to enable pod draining. Generally, the allowed disruption is controlled by the Min Available / Max unavailable or Running pods / Replicas parameter. Modify the Min Available / Max unavailable parameter at the PDB level or increase the number of Running pods / Replicas to push the Allowed Disruption value to 1 or greater.

Try again to upgrade the AKS cluster to the same version that you tried to upgrade to previously. This process triggers a reconciliation.

$ az aks upgrade --name <aksName> --resource-group <resourceGroupName>
Are you sure you want to perform this operation? (y/N): y
Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state.
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y

Solution 2: Back up, delete, and redeploy the PDB

Note

Use this solution if editing the PDB resource isn't a viable option.

Back up the PDBs by running the following command:

kubectl get pdb <pdb-name> -n <pdb-namespace> -o yaml > pdb-name-backup.yamland
Delete the PDB by running the following command:

kubectl delete pdb <pdb-name> -n <pdb-namespace>
After the new upgrade attempt finishes, redeploy the PDB by applying the backup file using the following command:

kubectl apply -f pdb-name-backup.yaml.

Try to upgrade the AKS cluster to the same version again that you tried to upgrade to previously. This process triggers a reconciliation.

$ az aks upgrade --name <aksName> --resource-group <resourceGroupName>
Are you sure you want to perform this operation? (y/N): y
Cluster currently in failed state. Proceeding with upgrade to existing version 1.28.3 to attempt resolution of failed cluster state.
Since control-plane-only argument is not specified, this will upgrade the control plane AND all nodepools to version . Continue? (y/N): y

Solution 3: Delete the pods that you can't drain or scale the workload down to zero

Delete the pods that you can't drain.