An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
Hello @Piotr Grochowski
The AKS cluster is experiencing intermittent timeouts only when connecting to a specific external HTTPS endpoint, while all other endpoints (Google, GitHub, etc.) respond normally.
This usually happens due to underlying Azure networking components such as:
- NAT Gateway flow issues
- Overlay + CNI datapath inefficiencies
- Asymmetric routing
- MTU / packet fragmentation issues
- SNAT port behavior with intermittent flows
Your second identical AKS cluster connects perfectly, which proves the external service is not the problem — it's specific to the networking path of this one cluster.
- Steps to Diagnose & Resolve
Step 1 : Validate Basic Pod‑Level Connectivity
Run tests from multiple pods across node pools:
kubectl run -it --image=busybox --restart=Never -- nslookup <external-endpoint>
kubectl run -it --image=curlimages/curl --restart=Never -- curl -v -I --connect-timeout 10 --max-time 45 https://<external-endpoint>
kubectl exec <pod-name> -- watch -n 1 'netstat -an | grep <endpoint-ip>'
Step 2: Compare Traffic Flow From a VM in Same VNet
Spin up a small test VM in the same subnet/VNet and run:
-
curl -v https://endpoint -
mtrortraceroute
Step 3: Use Azure Network Watcher Tools
verify IP Flow
Checks whether traffic is allowed or blocked.
Troubleshoot Connectivity:
Shows full hops, packet behavior, and delay points.
Packet capture (Pcaps)(Vm or Vnet level)
Used when user insist on deep inspection.(Mostly Used)
**NSG flow and Insights
**To track whether flow resets or drops are occurring.
Step 4: Check NAT Gateway Behavior
- Raise idle timeout to 30 minutes (already verified)
- Validate SNAT ports are not getting reused
- Validate no asymmetric routing
Step 5: MTU / Fragmentation Testing
Commands to test:
ping <endpoint-ip> -M do -s 1350
If failures begin at higher payloads > MTU mismatch likely.
Step 6: Compare with any Known Good Cluster
Since a second AKS cluster works fine, compare:
- Node pool OS versions
- Underlying VNet/subnet routing tables
- NAT Gateway associations
- UDRs
- Network policies
This comparison has solved multiple internal cases.
Reference links
https://learn.microsoft.com/en-us/azure/aks/aks-diagnostics#example-scenario-diagnose-connectivity-issues
https://github.com/MicrosoftDocs/SupportArticles-docs/blob/main/support/azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections.md
https://learn.microsoft.com/en-us/azure/architecture/operator-guides/aks/troubleshoot-network-aks
https://docs.azure.cn/en-us/aks/aks-diagnostics
Thanks,
Manish Deshpande