AKS Cluster Status is Failed

Ramakrishna Mula 0 Reputation points
2023-11-15T15:15:09.8166667+00:00

Hi All,

While upgrading the cluster version, the cluster status went into a failed state and it has not come back to normal even after 24 hours.

I tried to update the cluster status using the command line, even it did not come back to normal.

"az resource update --ids <aks-resource-id>"

Can anyone suggest?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,073 questions
{count} votes

1 answer

Sort by: Most helpful
  1. shiva patpi 13,251 Reputation points Microsoft Employee
    2023-11-15T17:34:12.67+00:00

    @Ramakrishna Mula ,

    Below are the couple of reasons why AKS cluster will go to Failed state at the time of upgrade:

    1. Kindly check if there are any blocking Pod Disruption Budgets (PDBs)
      1. If there are any strict PDBs , those might block the nodes in draining (i.e. moving the pods to new nodes). By default the process will keep trying to move and at the end it will time out (90 mins) - that's where the cluster goes to Failed State.
        1. Mitigations:- You can temporarily delete the PDBs , once upgrade completes - you can re-apply back those PDBs
    2. Please check if you are using the Custom DNS servers or Firewalls blocking the outbound connectivity .
      1. At the time of upgrade , when new nodes gets bootstrapped - they need to have the outbound internet connectivity so that they can reach out to mcr.microsoft.com or ubuntu.com for deploying additional post deployment tasks.

    ///////////////

    How to check the logs:

    1. You can see the activity logs in the Azure Portal
    2. Also validate the status of Virtual Machine Scale Set (Failed/Succeeded)
    3. Also check the Instances status of VMSS !
    4. You can use : kubectl get events -A
    5. Check how many nodes were upgraded
      1. kubectl get nodes -o wide

    /////////

    Try to re-run the reconciliation command:

    az resource update --ids <aks-resource-id>

    (When you run that command , keep a tab on the kubectl get events , activity logs from Azure Portal AKS cluster)

    ////

    If none of them are solving your issue , please provide your AKS resourceID for additional troubleshooting !

    Regards,

    Shiva.

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.