Below are the couple of reasons why AKS cluster will go to Failed state at the time of upgrade:
- Kindly check if there are any blocking Pod Disruption Budgets (PDBs)
- If there are any strict PDBs , those might block the nodes in draining (i.e. moving the pods to new nodes). By default the process will keep trying to move and at the end it will time out (90 mins) - that's where the cluster goes to Failed State.
- Mitigations:- You can temporarily delete the PDBs , once upgrade completes - you can re-apply back those PDBs
- If there are any strict PDBs , those might block the nodes in draining (i.e. moving the pods to new nodes). By default the process will keep trying to move and at the end it will time out (90 mins) - that's where the cluster goes to Failed State.
- Please check if you are using the Custom DNS servers or Firewalls blocking the outbound connectivity .
- At the time of upgrade , when new nodes gets bootstrapped - they need to have the outbound internet connectivity so that they can reach out to mcr.microsoft.com or ubuntu.com for deploying additional post deployment tasks.
///////////////
How to check the logs:
- You can see the activity logs in the Azure Portal
- Also validate the status of Virtual Machine Scale Set (Failed/Succeeded)
- Also check the Instances status of VMSS !
- You can use : kubectl get events -A
- Check how many nodes were upgraded
- kubectl get nodes -o wide
/////////
Try to re-run the reconciliation command:
az resource update --ids <aks-resource-id>
(When you run that command , keep a tab on the kubectl get events , activity logs from Azure Portal AKS cluster)
////
If none of them are solving your issue , please provide your AKS resourceID for additional troubleshooting !
Regards,
Shiva.