az aks upgrade to kubernetes version 1.18.14 is failing because of “Pod Disruption Budgets” (partially completed)

Colasanto, Francesca 26 Reputation points
2021-03-30T11:33:32.147+00:00

az aks upgrade --resource-group --name --kubernetes-version 1.18.14 (from 1.17.9)

is reporting the following error:

Deployment failed. Correlation ID: fa7565e0-a741-4ef2-accf-a76be59da209. Drain did not complete pods [nginx-ingress-ingress-nginx-controller-744847f7b8-kh7bc] on vm aks-agentpool-42415862-vmss000004. Check Pod Disruption Budgets

This is causing an inconsistent configuration (some nodes look like updated to 1.18.14 and some nodes not yet e.g.aks-agentpool-42415862-vmss000004 ).

Any hints will be appreciated.

Thanks !

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,878 questions
{count} vote

Accepted answer
  1. shiva patpi 13,141 Reputation points Microsoft Employee
    2021-03-30T22:49:48.87+00:00

    Hello @Colasanto, Francesca ,
    Thanks for your query !
    Based upon your existing poddisruptionbudgets configuration , aks upgrade failure is expected. The node aks-agentpool-42415862-vmss000004 was not upgraded because the process was not able to move the pod nginx-ingress-ingress-nginx-controller-744847f7b8-kh7bc to another node (Failed to drain the node due to pod PDBs )

    You are hitting the issue mentioned at

    Take a look at similar post

    Basics of PDB

    The affect of PDB

    How to configure PDB (Best practices)

    (Detailed description mentioned in the above article )
    In short , If you see your PDB , it clearly says minimum available should be always 1 (i.e. atleast 1 pod should be available all the time ) . At the upgrade , it will try to drain the node . As a part of draining the node - pods will be moved from one node to another node. Since PDB configuration says minimum available is 1

    Basic Rule while defining PDB:-
    Have max allowed disruption on PDB less than existing no. of replicas during upgrade

    Mitigation1:-
    Try deleting the PDB and do the upgrade
    kubectl delete pdb nginx-ingress-ingress-nginx-controller

    Mitigation2:-
    Try to increase the number of replicas of the pod nginx-ingress-ingress-nginx-controller in your deployment yaml file

    Mitigation3:-
    Try to change the maximum allowed disruptions to less than allowed number of replicas

    Hope above explanation helps out in understanding and resolving the issue. If it helps - kindly Upvote and Accept the Answer

    2 people found this answer helpful.

1 additional answer

Sort by: Most helpful
  1. COLASANTO, FRANCESCA 31 Reputation points
    2021-06-17T06:30:44.603+00:00

    PodDisruptionBudget yaml file sample:

    apiVersion: policy/v1beta1
    kind: PodDisruptionBudget
    metadata:
    name: nginx-pdb
    namespace: <your_namespace>
    spec:
    maxUnavailable: 1
    selector:
    matchLabels:
    app: nginx-frontend

    You can run in your kube context

    $ kubectl get poddisruptionbudget -n <your_namespace>

    $ kubectl get poddisruptionbudget <poddisruptionbudgetname> -n <your_namespace> -o yaml > /tmp/poddisruptionbudget .yaml

    In /tmp/poddisruptionbudget .yaml set "maxUnavailable: 1", save and exit

    $ kubectl apply -f /tmp/poddisruptionbudget .yaml -n <your_namespace>

    For details see https://learn.microsoft.com/en-us/azure/aks/operator-best-practices-scheduler#plan-for-availability-using-pod-disruption-budgets

    1 person found this answer helpful.
    0 comments No comments