After upgrade to AKS 1.25.5 cluster is in failed state. All pods are unable to connect to the local services through their names and failing. Outside connection e.g. github also fails. It seems because metrics-server is not able to connect to the API server.
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341577 1 pod_nanny.go:69] Version: 1.8.14
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341601 1 pod_nanny.go:85] Watching namespace: kube-system, pod: metrics-server-6fcdf95df7-49f6k, container: metrics-server. metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.341608 1 pod_nanny.go:86] storage: MISSING, extra_storage: 0Gi
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440128 1 pod_nanny.go:189] Failed to read data from config file "/etc/config/NannyConfiguration": open /etc/config/NannyConfiguration: no such file or directory, using default parameters
metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440194 1 pod_nanny.go:116] cpu: 44m, extra_cpu: 0.5m, memory: 51Mi, extra_memory: 4Mi metrics-server-vpa ERROR: logging before flag.Parse: I0216 23:09:50.440248 1 pod_nanny.go:145] Resources: [{Base:{i:{value:44 scale:-3}d:{Dec:<nil>} s:44m Format:DecimalSI} ExtraPerNode:{i:{value:5 scale:-4} d:{Dec:<nil>} s: Format:DecimalSI} Name:cpu} {Base:{i:{value:53477376 scale:0} d:{Dec:<nil>} s:51Mi Format:BinarySI} ExtraPerNode:{i:{value:4194304 scale:0} d:{Dec:<nil>} s:4Mi Format:BinarySI} Name:memory}]
metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:10:20.447278 1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: i/o timeout
metrics-server Error: unable to load configmap based request-header-client-ca-file: Get "https://production-dns-23c01f45.hcp.westeurope.azmk8s.io:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp: i/o timeout
At times it can find the IP of that URL, but still ends up in timeout.
metrics-server-vpa ERROR: logging before flag.Parse: E0216 23:38:00.574679 1 nanny_lib.go:128] Get "https://production-dns-23c01f45.hcp. ││ westeurope.azmk8s.io:443/api/v1/nodes?resourceVersion=0": dial tcp: lookup production-dns-23c01f45.hcp.westeurope.azmk8s.io on 10.0.0.10:53: r ││ ead udp 10.244.9.6:45415->10.0.0.10:53: i/o timeout
This then further cause problems for konnectivity-agent. The given IP below is the IP of one of the metrics-server pods.
konnectivity-agent-7d7c4fdddc-sdcrg E0217 13:24:04.598863 1 client.go:447] "error dialing backend" err="dial tcp 10.244.9.6:4443: i/o timeout" dialID=8845197344407888474
Currently completle cluster is unusable. I connect to the cluster from my laptop with helm, kubectl etc. it seems like AKS internal network has some issue.