Share via

Diagnosing Intermittent Connectivity Issues to Specific External Endpoint from AKS Cluster

Piotr Grochowski 0 Reputation points
2025-12-16T11:32:40.86+00:00

We have an AKS cluster experiencing intermittent connection timeouts (30-35 seconds) when connecting to a specific external HTTPS endpoint, while all other external endpoints work perfectly. We have a second AKS cluster with identical configuration that connects to the same endpoint without issues.

Environment:

  • Azure Kubernetes Service (AKS)
  • Network: Azure CNI with Overlay mode
  • Outbound: NAT Gateway
  • Region: UK South
  • 8 nodes across multiple node pools

Symptoms:

  • ~50% of requests timeout after 30-35 seconds
  • DNS resolution is fast (<20ms)
  • TCP connection establishes quickly (<20ms)
  • SSL/TLS handshake completes successfully
  • Data transfer phase either times out or takes 3-30+ seconds
  • Other external endpoints (Google, GitHub, etc.) work consistently with 50-70ms response times

What We've Tried:

  • Increased NAT Gateway idle timeout to 30 minutes
  • Verified no SNAT port exhaustion
  • Tested from multiple pods across different nodes
  • Confirmed the endpoint works from our other AKS cluster
  • Ruled out application-level issues (same code works on other cluster)

Question: Since traditional network diagnostic tools like traceroute and mtr don't work well from within containers/Kubernetes, what Azure-native tools or methods can we use to:

  1. Trace the network path from AKS pods to external endpoints?
  2. Identify where packet loss or delays occur?
  3. Determine if there's an Azure networking layer issue between our cluster and this specific destination?

Are there Azure Network Watcher tools, diagnostic logs, or specific kubectl/Azure CLI commands that would help diagnose this type of intermittent connectivity issue to a specific external endpoint?

Azure Kubernetes Service
Azure Kubernetes Service

An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.


3 answers

Sort by: Most helpful
  1. Manish Deshpande 5,745 Reputation points Microsoft External Staff Moderator
    2025-12-18T07:37:41.2133333+00:00

    Hello @Piotr Grochowski

    The AKS cluster is experiencing intermittent timeouts only when connecting to a specific external HTTPS endpoint, while all other endpoints (Google, GitHub, etc.) respond normally.

    This usually happens due to underlying Azure networking components such as:

    • NAT Gateway flow issues
    • Overlay + CNI datapath inefficiencies
    • Asymmetric routing
    • MTU / packet fragmentation issues
    • SNAT port behavior with intermittent flows

    Your second identical AKS cluster connects perfectly, which proves the external service is not the problem — it's specific to the networking path of this one cluster.

    • Steps to Diagnose & Resolve

    Step 1 : Validate Basic Pod‑Level Connectivity

    Run tests from multiple pods across node pools:

    kubectl run -it --image=busybox --restart=Never -- nslookup <external-endpoint>
    kubectl run -it --image=curlimages/curl --restart=Never -- curl -v -I --connect-timeout 10 --max-time 45 https://<external-endpoint>
    kubectl exec <pod-name> -- watch -n 1 'netstat -an | grep <endpoint-ip>'
    
    
    

    Step 2: Compare Traffic Flow From a VM in Same VNet

    Spin up a small test VM in the same subnet/VNet and run:

    • curl -v https://endpoint
    • mtr or traceroute

    Step 3: Use Azure Network Watcher Tools

    verify IP Flow

    Checks whether traffic is allowed or blocked.

    Troubleshoot Connectivity:
    Shows full hops, packet behavior, and delay points.

    Packet capture (Pcaps)(Vm or Vnet level)
    Used when user insist on deep inspection.(Mostly Used)

    **NSG flow and Insights
    **To track whether flow resets or drops are occurring.

    Step 4: Check NAT Gateway Behavior

    1. Raise idle timeout to 30 minutes (already verified)
    2. Validate SNAT ports are not getting reused
    3. Validate no asymmetric routing

    Step 5: MTU / Fragmentation Testing

    Commands to test:

    ping <endpoint-ip> -M do -s 1350
    

    If failures begin at higher payloads > MTU mismatch likely.

    Step 6: Compare with any Known Good Cluster

    Since a second AKS cluster works fine, compare:

    • Node pool OS versions
    • Underlying VNet/subnet routing tables
    • NAT Gateway associations
    • UDRs
    • Network policies

    This comparison has solved multiple internal cases.

    Reference links

    https://learn.microsoft.com/en-us/azure/aks/aks-diagnostics#example-scenario-diagnose-connectivity-issues
    https://github.com/MicrosoftDocs/SupportArticles-docs/blob/main/support/azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections.md

    https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/troubleshoot-network-aks

    https://docs.azure.cn/en-us/aks/aks-diagnostics

    Thanks,
    Manish Deshpande

    0 comments No comments

  2. Piotr Grochowski 0 Reputation points
    2025-12-16T13:48:47.51+00:00

    as I shared earlier, we're not experiencing any DNS issues and AKS Diagnostics tool is showing there's no problems within the cluster


  3. Q&A Assist
    2025-12-16T11:33:00.3266667+00:00
    AI generated content. This question contains an answer created with AI from Q&A Assist. Learn more

    To diagnose intermittent connectivity issues from your AKS cluster to a specific external endpoint, you can utilize several Azure-native tools and methods:

    1. Azure Network Watcher: This service provides various tools to monitor and diagnose network issues in Azure. You can use the following features:
      • Connection Troubleshoot: This tool helps you check the connectivity between your AKS pods and external endpoints. You can specify the source (your AKS cluster) and the destination (the external endpoint) to see if the connection is successful.
      • Network Performance Monitor: This tool can help you monitor the performance of your network and identify any latency or packet loss issues over time.
    2. Diagnosing Packet Loss or Delays: You can use the Network Watcher to run a Packet Capture on your AKS cluster. This will allow you to capture packets and analyze them for any signs of loss or delays. You can also check the Network Security Group (NSG) flow logs to see if any traffic is being blocked or dropped.
    3. Azure CLI Commands: You can use the Azure CLI to run commands that provide insights into your network setup. For example:
      • Use az network watcher show-subscription-flow-log to check the flow logs for your subscription, which can help identify issues in the network path.
      • Use az aks show to verify the configuration of your AKS cluster and ensure that the outbound type and NAT Gateway settings are correct.
    4. Kubernetes Commands: You can also run diagnostics from within your AKS pods:
      • Use kubectl exec to run commands like curl or nslookup from within your pods to test connectivity and DNS resolution directly.
      • Consider deploying a debugging pod with tools like tcpdump or iperf to gather more detailed network metrics.

    These tools and methods should help you trace the network path, identify packet loss or delays, and determine if there are any Azure networking layer issues affecting your connectivity to the specific external endpoint.


    References:

Your answer

Answers can be marked as 'Accepted' by the question author and 'Recommended' by moderators, which helps users know the answer solved the author's problem.