az aks upgrade to kubernetes version 1.18.14 is failing because of “Pod Disruption Budgets” (partially completed)

Question

az aks upgrade --resource-group --name --kubernetes-version 1.18.14 (from 1.17.9)

is reporting the following error:

Deployment failed. Correlation ID: fa7565e0-a741-4ef2-accf-a76be59da209. Drain did not complete pods [nginx-ingress-ingress-nginx-controller-744847f7b8-kh7bc] on vm aks-agentpool-42415862-vmss000004. Check Pod Disruption Budgets

This is causing an inconsistent configuration (some nodes look like updated to 1.18.14 and some nodes not yet e.g.aks-agentpool-42415862-vmss000004 ).

Any hints will be appreciated.

Thanks !

Accepted Answer

Hello @Colasanto, Francesca ,
Thanks for your query !
Based upon your existing poddisruptionbudgets configuration , aks upgrade failure is expected. The node aks-agentpool-42415862-vmss000004 was not upgraded because the process was not able to move the pod nginx-ingress-ingress-nginx-controller-744847f7b8-kh7bc to another node (Failed to drain the node due to pod PDBs )

You are hitting the issue mentioned at

https://revolgy.com/blog/kubernetes-in-production-poddisruptionbudget/
(See the section PDB with 1 replica)

Take a look at similar post

https://stackoverflow.com/questions/53671729/how-to-configure-pod-disruption-budget-to-drain-kubernetes-node

Basics of PDB

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/

The affect of PDB

https://kubernetes.io/docs/concepts/workloads/pods/disruptions/#pod-disruption-budgets

How to configure PDB (Best practices)

(Detailed description mentioned in the above article )
In short , If you see your PDB , it clearly says minimum available should be always 1 (i.e. atleast 1 pod should be available all the time ) . At the upgrade , it will try to drain the node . As a part of draining the node - pods will be moved from one node to another node. Since PDB configuration says minimum available is 1

Basic Rule while defining PDB:-
Have max allowed disruption on PDB less than existing no. of replicas during upgrade

Mitigation1:-
Try deleting the PDB and do the upgrade
kubectl delete pdb nginx-ingress-ingress-nginx-controller

Mitigation2:-
Try to increase the number of replicas of the pod nginx-ingress-ingress-nginx-controller in your deployment yaml file

Mitigation3:-
Try to change the maximum allowed disruptions to less than allowed number of replicas

Hope above explanation helps out in understanding and resolving the issue. If it helps - kindly Upvote and Accept the Answer

Answer

PodDisruptionBudget yaml file sample:

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
name: nginx-pdb
namespace:
spec:
maxUnavailable: 1
selector:
matchLabels:
app: nginx-frontend

You can run in your kube context

$ kubectl get poddisruptionbudget -n

$ kubectl get poddisruptionbudget -n -o yaml > /tmp/poddisruptionbudget .yaml

In /tmp/poddisruptionbudget .yaml set "maxUnavailable: 1", save and exit

$ kubectl apply -f /tmp/poddisruptionbudget .yaml -n

For details see https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-scheduler#plan-for-availability-using-pod-disruption-budgets

az aks upgrade to kubernetes version 1.18.14 is failing because of “Pod Disruption Budgets” (partially completed)

1 additional answer