AKS status: Failed (Running)

Peter 0 Reputation points
2024-05-04T14:47:28.2233333+00:00

After stopping and then starting AKS the status became Failed (Running).

Tried using Azure portal and az cli, it's not possible to successfully run az aks update or operation-abort commands.

Tried also kubectl

kubectl get nodes

returns:

{"code":"KubernetesOperationError","message":"Failed to run command in managed cluster due to kubernetes failure. details: Get \"https://aks-....io:443/api/v1/namespaces/aks-command\": dial tcp: lookup aks-...io on X.Y.Z.W:53: no such host"}
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,999 questions
0 comments No comments
{count} votes

3 answers

Sort by: Most helpful
  1. Sina Salam 7,441 Reputation points
    2024-05-04T23:01:02.2333333+00:00

    Hello Peter,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    Problem

    Sequel to your questions and error code you posted, I understand that you are having issues with AKS status: Failed (Running). Once you stopped and then restarted your Azure Kubernetes Service (AKS) cluster, it ended up displaying a status of "Failed (Running)". You have tried various methods to fix this problem using the Azure Portal, Azure CLI (az), and kubectl, but none of them worked. Also, when you ran kubectl get nodes, it gave you a Kubernetes operation error as shown above.

    Scenario

    Peter ran into a problem with his AKS cluster after he stopped and then started it. When he looked at the status, it said "Failed (Running)", which was obviously not good. He tried everything he could think of to fix it using the Azure Portal and command-line tools like Azure CLI and kubectl. But no matter what he tried, nothing seemed to work. Then, when he used kubectl get nodes to see what was going on with the cluster, he got an error KubernetesOperationError that he could not reach AKS API endpoint.

    Solution

    This prescribed solution was based on the scenario given and your questions, while focusing on the problem statement. This error KubernetesOperationError indicates a failure to run commands in the managed cluster due to Kubernetes failure, specifically a DNS resolution issue as it cannot find the host.

    To solve the problem of the AKS cluster showing "Failed (Running)" status after a stop and start operation, there are couple of things to do and as well as to address some related thoughtful questions raised during your scenarios analysis:

    Confirm the root cause

    • Use the Azure portal to check the health status of your AKS cluster and related resources. Resource Health can provide insights into any ongoing issues affecting your cluster. The error message suggests a DNS lookup failure, which could be due to misconfigured DNS settings or services. So, run kubectl get svc -n kube-system to list services in the kube-system namespace, including DNS. Ensure that the DNS service is running correctly and that there are no errors related to DNS resolution.
    • Access the logs for your AKS cluster to identify any errors or warnings that might provide clues about the issue. Look for any recent events or anomalies that occurred during the stop and start operation.
    • Search the pod logs for access errors with kubectl logs <pod-name> -n <namespace-name> for service principal issues. If you have specific Azure policies, the az aks command invoke command can fail because of a disallowed configuration. Run the command to affirm.
    • Use the Azure CLI to retrieve detailed status information about your AKS cluster. Run the command az aks show --resource-group <resource-group-name> --name <aks-cluster-name> to get information such as the current state, node count, and Kubernetes version.
    • Use the Azure CLI to retrieve events associated with the AKS cluster. Run az aks get-events --resource-group <resource-group-name> --name <aks-cluster-name> to get information about events that may have occurred during the stop and start operation.
    • Run kubectl cluster-info to validate that you can access the Kubernetes API server. Ensure that the API server URL is correct and accessible.
    • Run kubectl get pods --all-namespaces to list all pods running in the cluster. Look for any pods that are in a pending or failed state, as they may indicate issues with critical components of the cluster.

    Your feedback will be needed to ensure it's DNS or Not.

    Finally

    In Azure Portal:

    User's image As shown above,

    1. Navigate to your AKS cluster in the Azure portal.
    2. Click on Diagnose and solve problems in the left navigation.
    3. Choose a category that best describes your issue.

    References

    Source: Check for Resource Health events impacting your AKS cluster. Accessed, 5/4/2024.

    Source: Azure Kubernetes Service Diagnose and Solve Problems. Accessed, 5/4/2024.

    Accept Answer

    I hope this is helpful! Do not hesitate to let me know if you have any other questions.

    ** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.

    Best Regards,

    Sina Salam NR.


  2. Peter 0 Reputation points
    2024-05-05T10:32:16.6066667+00:00

    Thank you for the reply.

    Yes, you have captured the correctly. AKS Stop action was finished without issues and then the Start command failed. Now the AKS status is Failed (Running).

    None of the kubectl commands work. They complain about the error I posted in my previous post.

    By the way: az aks get-events (as suggested) doesn't seem to be az cli command rather kubectl?

    Node pools there are 2 node pools with statuses (for both):

    • Provisioning state: Failed
    • Power state: Running

    I tried (just in case) Update image and Scale node pool, they fail as well.

    Another note is that in Monitoring | Insights section there is a "Monitor Settings" button. In the dialog there is Log Analytics workspace points to DefaultWorkspace... however this resource cannot be found. I created a new workspace, but I have not found a way in Azure portal how to change it and using az cli, it fails. Enabling/Disabling addons fails:

    (KubernetesAPICallFailed) API call to Kubernetes API Server failed.

    Code: KubernetesAPICallFailed

    Message: API call to Kubernetes API Server failed.

    0 comments No comments

  3. Peter 0 Reputation points
    2024-05-06T09:46:47.62+00:00

    The issue originated from the Log Analytics workspace; it disappeared for reasons unclear to me, possibly after a stop action was followed by an unsuccessful start.

    The same Log Analytics workspace had to be recreated. It was not possible to add the newly created workspace because both az cli and REST API fail when the AKS cluster status is 'Failed (Running)'. It's also important to note that it is recommended to wait approximately 30 minutes before attempting to restart the AKS cluster (after the workspace re-creation). The az cli was used to start the cluster.

    0 comments No comments