AKS pods not able to communicate to ACR and external MongoDB

Chris Christensen 1 Reputation point
2024-05-08T15:12:51.2633333+00:00

tldr

We noticed yesterday that some of our deployed pods started getting timeout errors while trying to connect to our external MongoDB hosted at DigitalOcean. The applications running on these pods are not doing much in volume and the MongoDB responds instantly when a pod can connect to it. We dug further and found that pods deployed to one of the allocated nodes works as expected, but pods allocated to the other node have connectivity problems not only to the MongoDB server but also in pulling images from the ACR. Changes to code were made a couple of weeks ago. AKS configuration has not changed in months. We need assistance resolving the connectivity issues so that the pods work as expected.

setup

  • We have an AKS cluster that has two namespaces, ws-stage and ws-prod.
  • We have two node pools, agentpool and userpool1.
    • agentpool has min nodes of 2 and max nodes of 2
      • CPU for both is 11%
      • Memory is 47% for 1 and 118% for the other (we just scaled this from 1 to 2 about an hour ago)
    • userpool1 has min nodes of 2 and max nodes of 3
      • 2 nodes are currently allocated: node-c, node-d
      • CPU is 7% for both
      • Memory is 63% for both
      • pods created in node-c work as expected
      • pods created in node-d get errors pulling image from ACR and cannot connect to external MongoDB
  • We have 3 deployments, ws-spa, ws-api, ws-service.
    • ws-api has 3 replicas
      • 2 pods created on node-d and have connectivity issues
      • 1 pod created on node-c works fine
    • ws-service has 1 replica.
      • 1 pod created always gets allocated to node-d and has connectivity issues with ACR and MongoDB
  • Other than these deployments, the rest of our setup is very vanilla

details

pod descriptions

  • running kubectl describe pods ws-service-xxx-yyy show
    • the pod is allocated to node-d
    • Events show it successfully pulled image from ACR and then a few seconds later it failed with "failed to do request: Head "https://ourcompany.azurecr.io/v2/ws-service/manifests/v0.3.2588": net/http: TLS handshake timeout"
    • Events also show multiple NodeNotReady events from the node-controller
  • running kubectl describe pods ws-api-5f6545b8c6-kbs7h show
  • running kubectl describe pods ws-api-5f6545b8c6-k7q9l show
    • the pod is allocated to node-c
    • No events
    • this pod performs as expected.

mongosh

I entered both a healthy pod and an unhealthy pod and installed mongosh to test connectivity to the DigitalOcean MongoDb

  • healthy pod
    • installation of mongosh was easy and it connected to the MongoDb instantly
  • unhealthy pod
    • installation was slow or non-responsive. Doing anything from apt-get update to apt-get install wget would take longer than it should. It would get stuck for a minute or so pulling headers.
    • doing the following command would just hang and never respond. I had to cancel it and try again a few times before it ran instantly: `wget -qO- https://www.mongodb.org/static/pgp/server-7.0.asc | \ tee /etc/apt/trusted.gpg.d/server-7.0.asc`
    • When I finally got mongosh installed and I tried connecting to our DigitalOcean MongoDB it always times out after 30 seconds.

questions

  1. I would think that the health of a node in AKS is something we should have to worry about in general especially considering the load we are putting on it. How do we deal with this to ensure we have healthy nodes from AKS and they work as expected?
  2. Does anyone have any advice about how to troubleshoot this further or how to resolve this? We need to get this resolved.
Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,459 questions
0 comments No comments
{count} votes

1 answer

Sort by: Most helpful
  1. Anveshreddy Nimmala 3,550 Reputation points Microsoft External Staff Moderator
    2024-05-09T04:38:29.5433333+00:00

    Hello Chris Christensen,

    Welcome to microsoft Q&A, thankyou for posting your query here.

    Pods on node-d are facing both TLS timeout and 401 Unauthorized errors (response status code indicates that the client request has not been completed because it lacks valid authentication credentials for the requested resource.) when pulling images from the Azure Container Registry (ACR).

    Use kubectl get nodes -o wide to verify the status of node-d.

    use kubectl describe node node-d to Confirm resource availability on node-d.

    SSH into node -d to Verify if it resolve DNS names and reach external endpoints, trying to access external services manually from node-d.

    use kubectl logs node-d to review the logs and check the system logs on node-d found under var/log directory for signs of errors that could explain the NodeNotReady state.

    try to bring up a new node in place of node -d and check if it works anyway.

    Hope this helps you.

    If an answer has been helpful, please consider accepting the answer to help increase visibility of this question for other members of the Microsoft Q&A community. If not, please let us know what is still needed in the comments so the question can be answered. Thank you for helping to improve Microsoft Q&A!


Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.