AKS Cluster in failed state (running)

Nitin Kumar 0 Reputation points
2023-08-19T13:15:08.7533333+00:00

Dear community,

Our prod AKS cluster is in failed state(running). Worker nodes is also in NotReady state.
Can you please help resolving this issue? We can not event stop it because it stuck in failed state. We tried running following commands but they failed after unning for hours

az resource update --ids /subscriptions/xxx/resourcegroups/xxx/providers/Microsoft.ContainerService/ManagedClusters/xxx

az aks update --resource-group xxx --name xxx

I see following error as well

  Warning  NetworkNotReady  2m34s (x14701 over 8h)  kubelet  network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

Below are some output of kubectl commands

[nlweb-prod@AZSAPLPPUSA01 tmp]$ kubectl get no -o wide

E0819 13:10:27.880180   25142 memcache.go:255] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0819 13:10:27.896518   25142 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0819 13:10:27.900139   25142 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0819 13:10:27.904477   25142 memcache.go:106] couldn't get resource list for metrics.k8s.io/v1beta1: the server is currently unable to handle the request
NAME                                  STATUS     ROLES   AGE   VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION      CONTAINER-RUNTIME
aks-primarypool-19920242-vmss000006   NotReady   agent   10h   v1.25.11   10.82.46.4    
Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,370 questions
{count} votes

2 answers

Sort by: Most helpful
  1. AirGordon 7,145 Reputation points
    2023-08-20T06:51:30.5966667+00:00

    You seem to be running 1 node which has failed.

    I'd suggest the node just needs to be rebooted as the first step, you can do this in VMSS view in the Azure Portal, or more lazily by stopping and starting the AKS Cluster from the AKS resource in the portal. Hopefully this remedies the initial issue of the node not being ready. If it doesn't then raising a case with Azure support would be my suggestion.

    I'd also suggest verifying your cluster implementation against current design best practices. https://learn.microsoft.com/en-us/azure/architecture/reference-architectures/containers/aks/baseline-aks

    0 comments No comments

  2. Nitin Kumar 0 Reputation points
    2023-08-21T09:04:12.21+00:00

    I tried rebooting and adding other nodes to nodepool. Even tried adding another system and user nodepool. The nodes are always in NotReady State. While describing nodes I see following error:

    Warning NetworkNotReady 2m34s (x14701 over 8h) kubelet network is not ready: container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: cni plugin not initialized

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.