AKS Cluster Status is Failed

Question

Ramakrishna Mula 0

Hi All,

While upgrading the cluster version, the cluster status went into a failed state and it has not come back to normal even after 24 hours.

I tried to update the cluster status using the command line, even it did not come back to normal.

"az resource update --ids <aks-resource-id>"

Can anyone suggest?

1 answer

Answer 1

Below are the couple of reasons why AKS cluster will go to Failed state at the time of upgrade:

Kindly check if there are any blocking Pod Disruption Budgets (PDBs)
1. If there are any strict PDBs , those might block the nodes in draining (i.e. moving the pods to new nodes). By default the process will keep trying to move and at the end it will time out (90 mins) - that's where the cluster goes to Failed State.
  1. Mitigations:- You can temporarily delete the PDBs , once upgrade completes - you can re-apply back those PDBs
Please check if you are using the Custom DNS servers or Firewalls blocking the outbound connectivity .
1. At the time of upgrade , when new nodes gets bootstrapped - they need to have the outbound internet connectivity so that they can reach out to mcr.microsoft.com or ubuntu.com for deploying additional post deployment tasks.

///////////////

How to check the logs:

/////////

Try to re-run the reconciliation command:

az resource update --ids <aks-resource-id>

(When you run that command , keep a tab on the kubectl get events , activity logs from Azure Portal AKS cluster)

////

If none of them are solving your issue , please provide your AKS resourceID for additional troubleshooting !

Regards,

Shiva.