Network troubleshooting on AKS

Subin Sabu 0 Reputation points
2025-03-13T13:00:14.8933333+00:00

Hi, I have two similar clusters one for development and other for testing. Node size, web app versions, node pools almost everything is similar except one is in South India and other in UK South. Out of this the AKS in UK South is lagging a lot. I accessed it from a VM in the same region to investigate if it's a region specific issue. There is a delay of around 7x times than the other cluster. I port forward both my web app and api through service and direct pod access, and the result was same.

By lagging, I mean there is a delay to load login page, for example there is a js content download while one of cluster is receiving it in say 1 or 2 seconds, other cluster is taking almost 7 seconds. I believe it has nothing to do with ingress since I check it directly on service and pods. And this is not a service specific issue since all the request on the login page and other api are delaying correspondingly. I think this has something to do with the networking of my cluster. And by the there was a drop of health in my loadbalancer to 75 from 100 for a brief time, not sure it has anything to do it. Can anyone please suggest me any way to troubleshoot the issue.

Azure Kubernetes Service
Azure Kubernetes Service
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
2,447 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Alessandro Vozza 1 Reputation point Microsoft Employee
    2025-03-14T15:32:12.8333333+00:00

    You could try to assess the inter-node performance with iperf. Deploy a server and a client (make sure they land on different nodes) with something like:

    apiVersion: v1
    kind: Pod
    metadata:
      name: iperf-server
      labels:
        app: iperf-server
    spec:
      containers:
        - name: iperf
          image: cagedata/iperf3
          command: ["iperf3", "-s"]
          ports:
            - containerPort: 5201
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: iperf-server
    spec:
      selector:
        app: iperf-server
      ports:
        - protocol: TCP
          port: 5201
          targetPort: 5201
    apiVersion: v1
    kind: Pod
    metadata:
      name: iperf-client
    spec:
      containers:
        - name: iperf
          image: cagedata/iperf3
          command: ["sleep", "30000"]
    
    

    Then check the performance by entering the client pod and running:

    kubectl exec -it iperf-client -- iperf3 -c iperf-server -p 5201
    
    0 comments No comments

  2. LISBOA-4826 245 Reputation points Volunteer Moderator
    2025-03-15T15:14:40.4666667+00:00

    Hello Subin Sabu

    Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

    I understand that you are experiencing issue with Network on you AKS cluster.

    The troubleshooting on AKS sometimes can be complicated but I want to share some tools that you can try.

    • My first question, Do you have Azure monitoring for the both AKS clusters?
    • If yes what are the insights from there?

    A)

    You can try collect the TCPDUMP, that you can use WireShark to read the captures.

    https://github.com/ioanc/k8s-network-troubleshooting/blob/master/tcpdump-node-local.sh

    Official document from MS how to collect it - https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/logs/capture-tcp-dump-linux-node-aks

    B)

    Installing the Troubleshooting tools to help, Pull this image for example:

    docker.io/fransouza/troubleshooting-network-tools:v1 
    

    Content : network tools (nslookup/ping/nc/dig/traceroute/ifconfig)

    C)

    Basic troubleshooting network in AKS

    https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/troubleshoot-network-aks#pod-fails-to-allocate-the-ip-address

    D)

    Also comparing the AKS cluster it's a good idea.

    -Please check the Ports and any other details for CoreDNS pods

    -System MODE nodepool, how many instances did you have? Is there any alerts ?

    -What do you have during such events of latency or timeout on Kubectl get Events or on the describe of those pods.

    -Run Kubectl TOP nodes to monitor the utilization and possible capacity issues.

    -What kind of Disks are you using for the Nodes, Managed or Ephemeral?

    Details here:

    E)

    And last, but not less important if you want to have guarantee of performance you can upgrade the SLA Tier of your AKS cluster, if it's the case that you are using Free TIER.

    Free Tier AKS doesn't have SLA to give any performance.

    https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers

    If it was helpful, please ACCEPT the Answer and click "UpVOTE" on this post to let us know.

    Thank You.

    Lisboa

    0 comments No comments

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.