Network troubleshooting on AKS

Question

Network troubleshooting on AKS

Subin Sabu 0

Hi, I have two similar clusters one for development and other for testing. Node size, web app versions, node pools almost everything is similar except one is in South India and other in UK South. Out of this the AKS in UK South is lagging a lot. I accessed it from a VM in the same region to investigate if it's a region specific issue. There is a delay of around 7x times than the other cluster. I port forward both my web app and api through service and direct pod access, and the result was same.

By lagging, I mean there is a delay to load login page, for example there is a js content download while one of cluster is receiving it in say 1 or 2 seconds, other cluster is taking almost 7 seconds. I believe it has nothing to do with ingress since I check it directly on service and pods. And this is not a service specific issue since all the request on the login page and other api are delaying correspondingly. I think this has something to do with the networking of my cluster. And by the there was a drop of health in my loadbalancer to 75 from 100 for a brief time, not sure it has anything to do it. Can anyone please suggest me any way to troubleshoot the issue.

Pramidha Yathipathi 1,135 Reputation points Microsoft External Staff Moderator

2025-03-17T09:18:56.4333333+00:00

Hi Subin Sabu,

Just checking in to see if you had a chance to review the answer on your question. Feel free to reach out if you have any further queries.

If you found the information useful, please click "Upvote" on the post to let us know.

Thank You.
Pramidha Yathipathi 1,135 Reputation points Microsoft External Staff Moderator

2025-03-18T06:42:05.62+00:00

Hi Subin Sabu,

Just checking in to see if you had a chance to review the answer on your question. Feel free to reach out if you have any further queries.

If you found the information useful, please click "Upvote" on the post to let us know.

Thank You.

2 answers

Your answer

Pramidha Yathipathi 1,135 Reputation points Microsoft External Staff Moderator

2025-03-17T09:18:56.4333333+00:00

Hi Subin Sabu,

Just checking in to see if you had a chance to review the answer on your question. Feel free to reach out if you have any further queries.

If you found the information useful, please click "Upvote" on the post to let us know.

Thank You.
Pramidha Yathipathi 1,135 Reputation points Microsoft External Staff Moderator

2025-03-18T06:42:05.62+00:00

Hi Subin Sabu,

Just checking in to see if you had a chance to review the answer on your question. Feel free to reach out if you have any further queries.

If you found the information useful, please click "Upvote" on the post to let us know.

Thank You.

Answer 1

You could try to assess the inter-node performance with iperf. Deploy a server and a client (make sure they land on different nodes) with something like:

apiVersion: v1
kind: Pod
metadata:
  name: iperf-server
  labels:
    app: iperf-server
spec:
  containers:
    - name: iperf
      image: cagedata/iperf3
      command: ["iperf3", "-s"]
      ports:
        - containerPort: 5201
---
apiVersion: v1
kind: Service
metadata:
  name: iperf-server
spec:
  selector:
    app: iperf-server
  ports:
    - protocol: TCP
      port: 5201
      targetPort: 5201
apiVersion: v1
kind: Pod
metadata:
  name: iperf-client
spec:
  containers:
    - name: iperf
      image: cagedata/iperf3
      command: ["sleep", "30000"]

Then check the performance by entering the client pod and running:

kubectl exec -it iperf-client -- iperf3 -c iperf-server -p 5201

Answer 2

Hello Subin Sabu

Welcome to the Microsoft Q&A Platform! Thank you for asking your question here.

I understand that you are experiencing issue with Network on you AKS cluster.

The troubleshooting on AKS sometimes can be complicated but I want to share some tools that you can try.

My first question, Do you have Azure monitoring for the both AKS clusters?
If yes what are the insights from there?

A)

You can try collect the TCPDUMP, that you can use WireShark to read the captures.

https://github.com/ioanc/k8s-network-troubleshooting/blob/master/tcpdump-node-local.sh

Official document from MS how to collect it - https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/logs/capture-tcp-dump-linux-node-aks

B)

Installing the Troubleshooting tools to help, Pull this image for example:

docker.io/fransouza/troubleshooting-network-tools:v1

Content : network tools (nslookup/ping/nc/dig/traceroute/ifconfig)

C)

Basic troubleshooting network in AKS

https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/troubleshoot-network-aks#pod-fails-to-allocate-the-ip-address

D)

Also comparing the AKS cluster it's a good idea.

-Please check the Ports and any other details for CoreDNS pods

-System MODE nodepool, how many instances did you have? Is there any alerts ?

-What do you have during such events of latency or timeout on Kubectl get Events or on the describe of those pods.

-Run Kubectl TOP nodes to monitor the utilization and possible capacity issues.

-What kind of Disks are you using for the Nodes, Managed or Ephemeral?

Details here:

E)

And last, but not less important if you want to have guarantee of performance you can upgrade the SLA Tier of your AKS cluster, if it's the case that you are using Free TIER.

Free Tier AKS doesn't have SLA to give any performance.

https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers

If it was helpful, please ACCEPT the Answer and click "UpVOTE" on this post to let us know.

Thank You.

Lisboa

Share via

Network troubleshooting on AKS

2 answers

Your answer