Hello Peter,
Welcome to the Microsoft Q&A and thank you for posting your questions here.
Problem
Sequel to your questions and error code you posted, I understand that you are having issues with AKS status: Failed (Running). Once you stopped and then restarted your Azure Kubernetes Service (AKS) cluster, it ended up displaying a status of "Failed (Running)". You have tried various methods to fix this problem using the Azure Portal, Azure CLI (az), and kubectl, but none of them worked. Also, when you ran kubectl get nodes
, it gave you a Kubernetes operation error as shown above.
Scenario
Peter ran into a problem with his AKS cluster after he stopped and then started it. When he looked at the status, it said "Failed (Running)", which was obviously not good. He tried everything he could think of to fix it using the Azure Portal and command-line tools like Azure CLI and kubectl. But no matter what he tried, nothing seemed to work. Then, when he used kubectl get nodes to see what was going on with the cluster, he got an error KubernetesOperationError that he could not reach AKS API endpoint.
Solution
This prescribed solution was based on the scenario given and your questions, while focusing on the problem statement. This error KubernetesOperationError
indicates a failure to run commands in the managed cluster due to Kubernetes failure, specifically a DNS resolution issue as it cannot find the host.
To solve the problem of the AKS cluster showing "Failed (Running)" status after a stop and start operation, there are couple of things to do and as well as to address some related thoughtful questions raised during your scenarios analysis:
Confirm the root cause
- Use the Azure portal to check the health status of your AKS cluster and related resources. Resource Health can provide insights into any ongoing issues affecting your cluster. The error message suggests a DNS lookup failure, which could be due to misconfigured DNS settings or services. So, run
kubectl get svc -n kube-system
to list services in the kube-system namespace, including DNS. Ensure that the DNS service is running correctly and that there are no errors related to DNS resolution. - Access the logs for your AKS cluster to identify any errors or warnings that might provide clues about the issue. Look for any recent events or anomalies that occurred during the stop and start operation.
- Search the pod logs for access errors with
kubectl logs <pod-name> -n <namespace-name>
for service principal issues. If you have specific Azure policies, the az aks command invoke command can fail because of a disallowed configuration. Run the command to affirm. - Use the Azure CLI to retrieve detailed status information about your AKS cluster. Run the command
az aks show --resource-group <resource-group-name> --name <aks-cluster-name>
to get information such as the current state, node count, and Kubernetes version. - Use the Azure CLI to retrieve events associated with the AKS cluster. Run
az aks get-events --resource-group <resource-group-name> --name <aks-cluster-name>
to get information about events that may have occurred during the stop and start operation. - Run
kubectl cluster-info
to validate that you can access the Kubernetes API server. Ensure that the API server URL is correct and accessible. - Run
kubectl get pods --all-namespaces
to list all pods running in the cluster. Look for any pods that are in a pending or failed state, as they may indicate issues with critical components of the cluster.
Your feedback will be needed to ensure it's DNS or Not.
Finally
In Azure Portal:
As shown above,
- Navigate to your AKS cluster in the Azure portal.
- Click on Diagnose and solve problems in the left navigation.
- Choose a category that best describes your issue.
References
Source: Check for Resource Health events impacting your AKS cluster. Accessed, 5/4/2024.
Source: Azure Kubernetes Service Diagnose and Solve Problems. Accessed, 5/4/2024.
Accept Answer
I hope this is helpful! Do not hesitate to let me know if you have any other questions.
** Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful ** so that others in the community facing similar issues can easily find the solution.
Best Regards,
Sina Salam NR.