AKS metrics-server fails to connect to API server

Arundeep Singh 0 Reputation points
2023-02-17T13:27:09.3933333+00:00

After upgrade to AKS 1.25.5 cluster is in failed state. All pods are unable to connect to the local services through their names and failing. Outside connection e.g. github also fails. It seems because metrics-server is not able to connect to the API server.

metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341577       1 pod_nanny.go:69] Version: 1.8.14   
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341601       1 pod_nanny.go:85] Watching namespace: kube-system, pod: metrics-server-6fcdf95df7-49f6k, container: metrics-server.                                                                                        metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341608       1 pod_nanny.go:86] storage: MISSING, extra_storage: 0Gi      
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440128       1 pod_nanny.go:189] Failed to read data from config file "/etc/config/NannyConfiguration": open /etc/config/NannyConfiguration: no such file or directory, using default parameters                         
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440194       1 pod_nanny.go:116] cpu: 44m, extra_cpu: 0.5m, memory: 51Mi, extra_memory: 4Mi                                                                                                                              metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440248       1 pod_nanny.go:145] Resources: [{Base:{i:{value:44 scale:-3}d:{Dec:<nil>} s:44m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:53477376 scale:0} d:{Dec:<nil>} s:51Mi Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]     
metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:10:20.447278       1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: i/o timeout                                                      


metrics-server Error: unable to load configmap based request-header-client-ca-file: Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp: i/o timeout

At times it can find the IP of that URL, but still ends up in timeout.

 metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:38:00.574679       1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp. ││ westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup production-dns-23c01f45.hcp.westeurope.azmk8s.io on 10.0.0.10:53: r ││ ead udp 10.244.9.6:45415->10.0.0.10:53: i/o timeout

This then further cause problems for konnectivity-agent. The given IP below is the IP of one of the metrics-server pods.

konnectivity-agent-7d7c4fdddc-sdcrg E0217 13:24:04.598863       1 client.go:447] "error dialing backend" err="dial tcp 10.244.9.6:4443: i/o timeout" dialID=8845197344407888474 

Currently completle cluster is unusable. I connect to the cluster from my laptop with helm, kubectl etc. it seems like AKS internal network has some issue.

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,448 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Eddie Neto 1,251 Reputation points Microsoft Employee
    2023-02-27T09:32:05.65+00:00

    Hi @Arundeep Singh

    Thanks for reaching Microsoft Q&A.

    Regarding your issue above, could you please confirm if you have Azure Monitor add-on for AKS enabled? If yes, please disable and enable again.

    Disabling

    $az aks disable-addons --addons azure-policy --name MyAKSCluster --resource-group MyResourceGroup

    Enabling

    $az aks enable-addons --addons azure-policy --name MyAKSCluster --resource-group MyResourceGroup

    Some limitation here for the Azure policy.

    Hope this helps. Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

    1 person found this answer helpful.

  2. Naninga Karunaratne 0 Reputation points
    2023-02-17T21:08:28.48+00:00

    Hello there!

    sorry to hear that you are having issues with you cluster. From what version did you upgrade to 1.25.5? Also, to troubleshoot the issue, you can try the following steps:

    1. Check the cluster logs for any error messages.
    2. Verify that the control plane components are running and healthy.
    3. Confirm that the nodes are healthy and have enough resources.
    4. Ensure that the network connectivity between the nodes and the control plane is working.

    Hope it helps to achieve your goal

    Please "Accept as Answer" and Upvote if it helped, so that it can help others in the community looking for help on similar topics. Thank you!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.