AKS outbound long running request hanging

Mateusz T 0 Reputation points
2023-02-01T17:01:58.1166667+00:00

Cluster A exposes api.example.com endpoint that proxies traffic to a custom pod in cluster B (exposed using v1.api.example.com). This custom pod has 2 endpoints - one (let's call it B1) responding after 10 seconds and second one responding after 5 minutes (let's call it B2). Proxying is configured in cluster A using k8s Ingress pointing at ExternalName service with externalName configured to the cluster B domain.

Let's consider following scenarios:

  1. If I send a request (from postman/curl) directly to cluster B (on v1.api.example.com), I'm getting response from both endpoints B1 and B2 after expected delay.
  2. If I send request over cluster A (using api.example.com), then request is proxied to cluster B, but request to B1 succeeds and request to B2 is hanging from client perspective and returns no result (after way much longer than 5 minutes, not returning any HTTP error - I would expect something like 502/504, only when request from the client - then in A logs HTTP 499 is seen). I can see in B cluster logs of both ingress-nginx and service itself that result was returned as HTTP 200 after expected time in both cases - for B1 and B2 endpoint, but it seems that it was never consumed by ingress-nginx from cluster A although connection seems to be hanging all the time.
  3. Moreover, if I deploy a pod in cluster A with similar services - one responding after 10s and other one after 5mins, both work properly when reaching cluster A.
  4. If I get into a pod in cluster A and try to hit B2 request it returns response properly after 5 minutes. If I hit ingress-nginx from a pod in cluster A using its internal IP (and setting required host header) B2 request is also hanging infinitely (B1 doesn't so the B1/B2 services are reachable from there).

My conclusion is that it is related to some settings regarding the outgoing traffic from cluster A to B (or to internet in general). I was reading documentation about

AKS setup:

  • Kubernetes version 1.24.6
  • Azure Load Balancer with public ingress and egress IPs, Load Balancer Idle timeout on frontend is set to 25 minutes and on outbound as well, but it's strange for me that automatically created aksOutboundRule doesn't allow to choose kubernetes backend pool that is greyed out (when creating a different outbound rule the behavior is the same, so I'm wondering if 25 minutes timeout is applied to the outbound traffic): User's image
  • Ingress-nginx as ingress controller I have tried to set all timeout related settings in cluster A to high amounts to make sure that they are not shutting the underlying TCP connection silently down:
       keep-alive: "420"
       proxy-connect-timeout: "60"
       proxy-read-timeout: "1800"
       proxy-send-timeout: "1800"
       upstream-keepalive-timeout: "420"
  • Network policy extracted using az aks show :
     "networkProfile": {
       "dnsServiceIp": "10.0.0.10",
       "dockerBridgeCidr": "172.17.0.1/16",
       "ipFamilies": [
         "IPv4"
       ],
       "loadBalancerProfile": {
         "allocatedOutboundPorts": 0,
         "effectiveOutboundIPs": [
           {
             "id": "<id>",
             "resourceGroup": "<rg-id>"
           }
         ],
         "enableMultipleStandardLoadBalancers": null,
         "idleTimeoutInMinutes": 25,
         "managedOutboundIPs": {
           "count": 1,
           "countIpv6": null
         },
         "outboundIPs": null,
         "outboundIpPrefixes": null
       },
       "loadBalancerSku": "Standard",
       "natGatewayProfile": null,
       "networkMode": null,
       "networkPlugin": "azure",
       "networkPolicy": "calico",
       "outboundType": "loadBalancer",
       "podCidr": null,
       "podCidrs": null,
       "serviceCidr": "10.0.0.0/16",
       "serviceCidrs": [
         "10.0.0.0/16"
       ]
     },
  • The only Kubernetes network policy (autogenerated as well):
   apiVersion: networking.k8s.io/v1
   kind: NetworkPolicy
   metadata:
     annotations:
     generation: 1
     labels:
       addonmanager.kubernetes.io/mode: Reconcile
     name: konnectivity-agent
     namespace: kube-system
     resourceVersion: "598"
   spec:
     egress:
     - {}
     podSelector:
       matchLabels:
         app: konnectivity-agent
     policyTypes:
     - Egress
   status: {}
  • Clusters A and B deployed in separate Virtual Networks and each of them exposes services on a separate domain.

Can anyone please help me to figure out what can be the reason for the long running requests hanging when traffic goes over cluster A to cluster B?

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,820 questions
Azure Load Balancer
Azure Load Balancer
An Azure service that delivers high availability and network performance to applications.
395 questions
{count} votes