AKS metrics-server fails to connect to API server

Question

AKS metrics-server fails to connect to API server

Arundeep Singh 0

After upgrade to AKS 1.25.5 cluster is in failed state. All pods are unable to connect to the local services through their names and failing. Outside connection e.g. github also fails. It seems because metrics-server is not able to connect to the API server.

metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341577       1 pod_nanny.go:69] Version: 1.8.14   
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341601       1 pod_nanny.go:85] Watching namespace: kube-system, pod: metrics-server-6fcdf95df7-49f6k, container: metrics-server.                                                                                        metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341608       1 pod_nanny.go:86] storage: MISSING, extra_storage: 0Gi      
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440128       1 pod_nanny.go:189] Failed to read data from config file "/etc/config/NannyConfiguration": open /etc/config/NannyConfiguration: no such file or directory, using default parameters                         
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440194       1 pod_nanny.go:116] cpu: 44m, extra_cpu: 0.5m, memory: 51Mi, extra_memory: 4Mi                                                                                                                              metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440248       1 pod_nanny.go:145] Resources: [{Base:{i:{value:44 scale:-3}d:{Dec:<nil>} s:44m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:53477376 scale:0} d:{Dec:<nil>} s:51Mi Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]     
metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:10:20.447278       1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: i/o timeout                                                      


metrics-server Error: unable to load configmap based request-header-client-ca-file: Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp: i/o timeout

At times it can find the IP of that URL, but still ends up in timeout.

 metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:38:00.574679       1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp. ││ westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup production-dns-23c01f45.hcp.westeurope.azmk8s.io on 10.0.0.10:53: r ││ ead udp 10.244.9.6:45415->10.0.0.10:53: i/o timeout

This then further cause problems for konnectivity-agent. The given IP below is the IP of one of the metrics-server pods.

konnectivity-agent-7d7c4fdddc-sdcrg E0217 13:24:04.598863       1 client.go:447] "error dialing backend" err="dial tcp 10.244.9.6:4443: i/o timeout" dialID=8845197344407888474

Currently completle cluster is unusable. I connect to the cluster from my laptop with helm, kubectl etc. it seems like AKS internal network has some issue.

KarishmaTiwari-MSFT 20,772 Reputation points Microsoft Employee Moderator

2023-02-27T07:24:15.0333333+00:00

Hi Arundeep Singh ,

Checking in to see if the answer to your follow up question below helped? If you are still seeing issues, we would need a support engineer with access to the backend in order to troubleshoot the issue.

Please open a support request if you have the ability to do so. If not, please send an email to 'AzCommunity@microsoft.com' with the subject - Attn: Karishma. Provide your Subscription Id as well as a link to this thread in the email body. I will enable a one time support request for you and share instructions over the email.

Thanks.

2 answers

Your answer

KarishmaTiwari-MSFT 20,772 Reputation points Microsoft Employee Moderator

2023-02-27T07:24:15.0333333+00:00

Hi Arundeep Singh ,

Checking in to see if the answer to your follow up question below helped? If you are still seeing issues, we would need a support engineer with access to the backend in order to troubleshoot the issue.

Please open a support request if you have the ability to do so. If not, please send an email to 'AzCommunity@microsoft.com' with the subject - Attn: Karishma. Provide your Subscription Id as well as a link to this thread in the email body. I will enable a one time support request for you and share instructions over the email.

Thanks.

Answer 1

Eddie Neto 1,251 Microsoft Employee

Hi @Arundeep Singh

Thanks for reaching Microsoft Q&A.

Regarding your issue above, could you please confirm if you have Azure Monitor add-on for AKS enabled? If yes, please disable and enable again.

Disabling

$az aks disable-addons --addons azure-policy --name MyAKSCluster --resource-group MyResourceGroup

Enabling

$az aks enable-addons --addons azure-policy --name MyAKSCluster --resource-group MyResourceGroup

Some limitation here for the Azure policy.

Hope this helps. Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

Eddie Neto 1,251 Microsoft Employee

Hi @Arundeep Singh

Please follow the mitigation.

Hope this helps. Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

Create a metrics-server-config configmap under kube-system namespace to override metric server resource limit:



apiVersion: v1

kind: ConfigMap

metadata:

  name: metrics-server-config

  namespace: kube-system

  labels:

    kubernetes.io/cluster-service: "true"

    addonmanager.kubernetes.io/mode: EnsureExists

data:

  NannyConfiguration: |-

    apiVersion: nannyconfig/v1alpha1

    kind: NannyConfiguration

    baseCPU: 80m

    cpuPerNode: 5m

    baseMemory: 80Mi

    memoryPerNode: 8Mi

Answer 2

Naninga Karunaratne 0

Hello there!

sorry to hear that you are having issues with you cluster. From what version did you upgrade to 1.25.5? Also, to troubleshoot the issue, you can try the following steps:

Check the cluster logs for any error messages.
Verify that the control plane components are running and healthy.
Confirm that the nodes are healthy and have enough resources.
Ensure that the network connectivity between the nodes and the control plane is working.

Hope it helps to achieve your goal

Please "Accept as Answer" and Upvote if it helped, so that it can help others in the community looking for help on similar topics. Thank you!

Arundeep Singh 0 Reputation points

2023-02-18T16:52:44.1066667+00:00
Hi Naninga,

Upgrade was from 1.24.9.

Well the upgrade errors talks about metrics-server-pdb poddrain issues. But following the steps as suggested in the error message, I temporary deleted the blocking pdb and retriggered the upgrade.

At the time of creating the issue, cluster plan and nodes were ae 1.25.5 but with status "Failed". After retriggering the upgrade again with "--control-plane-only" option cluster is in success state now.

but that did not change in the effective outcome.

Main error so far still seems to be the same, " metrics-server" cannot connect to the API server. In turn konnectivity-agent keeps failing as it cannot connect to metrics-server.

Nodes are healthy as per Azure status and have enough resources.

And to your last point, that is the error that there is a network issue. If I knew 'how to ensure " these general suggestions, I assume there won't have any need to raise the ticket, right?
Naninga Karunaratne 0 Reputation points

2023-02-18T22:58:11.5233333+00:00

Hello Arundeep,

Hope you are keeping well. Apologies if the previous answer didn't cover your requirement.

To troubleshoot this issue for AKS metrics-server not being able to connect to the API server, you can follow the steps mentioned in the article "Troubleshoot collection of Prometheus metrics in Azure Monitor (preview)" https://learn.microsoft.com/en-us/azure/azure-monitor/essentials/prometheus-metrics-troubleshoot

Also, this forum is more of a general question and answers platform. If you would like us to deep dive your issue with more data & information from your side, please feel free to submit a ticket thru the Azure portal under "Support and Troubleshooting" and submit a "New Support Request" and we will be more than happy to help you out.

Share via

AKS metrics-server fails to connect to API server

2 answers

Your answer