AKS pods not able to communicate to ACR and external MongoDB

Question

AKS pods not able to communicate to ACR and external MongoDB

Chris Christensen 1

tldr

We noticed yesterday that some of our deployed pods started getting timeout errors while trying to connect to our external MongoDB hosted at DigitalOcean. The applications running on these pods are not doing much in volume and the MongoDB responds instantly when a pod can connect to it. We dug further and found that pods deployed to one of the allocated nodes works as expected, but pods allocated to the other node have connectivity problems not only to the MongoDB server but also in pulling images from the ACR. Changes to code were made a couple of weeks ago. AKS configuration has not changed in months. We need assistance resolving the connectivity issues so that the pods work as expected.

setup

We have an AKS cluster that has two namespaces, ws-stage and ws-prod.
We have two node pools, agentpool and userpool1.
- agentpool has min nodes of 2 and max nodes of 2
  - CPU for both is 11%
  - Memory is 47% for 1 and 118% for the other (we just scaled this from 1 to 2 about an hour ago)
- userpool1 has min nodes of 2 and max nodes of 3
  - 2 nodes are currently allocated: node-c, node-d
  - CPU is 7% for both
  - Memory is 63% for both
  - pods created in node-c work as expected
  - pods created in node-d get errors pulling image from ACR and cannot connect to external MongoDB
We have 3 deployments, ws-spa, ws-api, ws-service.
- ws-api has 3 replicas
  - 2 pods created on node-d and have connectivity issues
  - 1 pod created on node-c works fine
- ws-service has 1 replica.
  - 1 pod created always gets allocated to node-d and has connectivity issues with ACR and MongoDB
Other than these deployments, the rest of our setup is very vanilla

details

pod descriptions

running kubectl describe pods ws-service-xxx-yyy show
- the pod is allocated to node-d
- Events show it successfully pulled image from ACR and then a few seconds later it failed with "failed to do request: Head "https://ourcompany.azurecr.io/v2/ws-service/manifests/v0.3.2588": net/http: TLS handshake timeout"
- Events also show multiple NodeNotReady events from the node-controller
running kubectl describe pods ws-api-5f6545b8c6-kbs7h show
- the pod is allocated to node-d
- Events show it successfully pulled image from the ACR and then a few minutes later it failed with "failed to authorize: failed to fetch anonymous token: unexpected status from GET request to https://ourcompany.azurecr.io/oauth2/token?scope=repository%3Aws-api%3Apull&service=ourcompany.azurecr.io: 401 Unauthorized"
running kubectl describe pods ws-api-5f6545b8c6-k7q9l show
- the pod is allocated to node-c
- No events
- this pod performs as expected.

mongosh

I entered both a healthy pod and an unhealthy pod and installed mongosh to test connectivity to the DigitalOcean MongoDb

healthy pod
- installation of mongosh was easy and it connected to the MongoDb instantly
unhealthy pod
- installation was slow or non-responsive. Doing anything from apt-get update to apt-get install wget would take longer than it should. It would get stuck for a minute or so pulling headers.
- doing the following command would just hang and never respond. I had to cancel it and try again a few times before it ran instantly: `wget -qO- https://www.mongodb.org/static/pgp/server-7.0.asc | \ tee /etc/apt/trusted.gpg.d/server-7.0.asc`
- When I finally got mongosh installed and I tried connecting to our DigitalOcean MongoDB it always times out after 30 seconds.

questions

I would think that the health of a node in AKS is something we should have to worry about in general especially considering the load we are putting on it. How do we deal with this to ensure we have healthy nodes from AKS and they work as expected?
Does anyone have any advice about how to troubleshoot this further or how to resolve this? We need to get this resolved.

1 answer

Your answer

Answer 1

Anveshreddy Nimmala 3,550 Microsoft External Staff Moderator

Hello Chris Christensen,

Welcome to microsoft Q&A, thankyou for posting your query here.

Pods on node-d are facing both TLS timeout and 401 Unauthorized errors (response status code indicates that the client request has not been completed because it lacks valid authentication credentials for the requested resource.) when pulling images from the Azure Container Registry (ACR).

Use kubectl get nodes -o wide to verify the status of node-d.

use kubectl describe node node-d to Confirm resource availability on node-d.

SSH into node -d to Verify if it resolve DNS names and reach external endpoints, trying to access external services manually from node-d.

use kubectl logs node-d to review the logs and check the system logs on node-d found under var/log directory for signs of errors that could explain the NodeNotReady state.

try to bring up a new node in place of node -d and check if it works anyway.

Hope this helps you.

If an answer has been helpful, please consider accepting the answer to help increase visibility of this question for other members of the Microsoft Q&A community. If not, please let us know what is still needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!

Chris Christensen 1 Reputation point

2024-05-11T20:23:43.4166667+00:00
Hi Anveshreddy,

Thanks for your reply. Just an update, we resolved this by scaling the nodes to include an additional node and then cordoning and draining node-d.

Per your steps above:

Running kubectl get nodes -o wide it returns everything the same as the other nodes except for internal-ip. version, roles, status, os-image, kernal-version, container-runtime are all the same for all 6 nodes we currently have.

Running kubectl describe node show CPU at 20% and memory at 10% utilization.

Regarding accessing the logs for the node, I'm not sure if that can be done with kubectl logs . It looks like k8s is coming out with a node query tool in 1.30, but our nodes are on 1.27.

We are keeping node-d around until we understand why it would not work like its peer nodes.

Thanks!
Chris Christensen 1 Reputation point

2024-05-11T20:32:40.07+00:00
I found a way to access the node using kubectl debug:

node-d = aks-userpool1-21921076-vmss00000d

I ran the following:

kubectl debug node/aks-userpool1-21921076-vmss00000d -it --image=mcr.microsoft.com/cbl-mariner/busybox:2.0

It never returns a prompt because of the following errors:
Creating debugging pod node-debugger-aks-userpool1-21921076-vmss00000d-pg65h with container debugger on node aks-userpool1-21921076-vmss00000d.

Warning: container debugger: rpc error: code = Unknown desc = failed to pull and unpack image "mcr.microsoft.com/cbl-mariner/busybox:2.0": failed to resolve reference "mcr.microsoft.com/cbl-mariner/busybox:2.0": failed to do request: Head "https://mcr.microsoft.com/v2/cbl-mariner/busybox/manifests/2.0": net/http: TLS handshake timeout

Warning: container debugger: Back-off pulling image "mcr.microsoft.com/cbl-mariner/busybox:2.0"

It just continues to repeat with these two errors.
Anveshreddy Nimmala 3,550 Reputation points Microsoft External Staff Moderator

2024-05-13T05:49:10.5666667+00:00

Hello Chris Christensen,

Verify that DNS resolution is working correctly, If DNS resolution fails, check and configure your cluster's DNS settings.

nslookup mcr.microsoft.com

As a workaround, use the official BusyBox image from Docker Hub.

kubectl debug node/aks-userpool1-21921076-vmss00000d -it --image=busybox:latest

you can diagnose by checking the detailed logs of the Kubernetes components and networking setup.

kubectl get events --sort-by='.lastTimestamp'

kubectl logs node-debugger-aks-userpool1-21921076-vmss00000d-pg65h

Share via

AKS pods not able to communicate to ACR and external MongoDB

tldr

setup

details

pod descriptions

mongosh

questions

1 answer

Your answer