Prevent SNAT port exhaustion in AKS standard load balancer?

Tanul 1,251 Reputation points
2023-05-30T18:26:03.82+00:00

Hello,

We need to modify the outbound rule of standard load balancer(we cannot use NAT gateway). What is the best practices to set the values for

  1. TCP Idle timeout
  2. Choice between Maximum number of backend instance OR ports per instance

2nd once can be still calculated or can be modified gradually as we scale up the nodes but don't know how to choose or calculate the first one.

Any suggestions please..

Kind Regards,

Tanul

Azure Kubernetes Service (AKS)
Azure Kubernetes Service (AKS)
An Azure service that provides serverless Kubernetes, an integrated continuous integration and continuous delivery experience, and enterprise-grade security and governance.
1,892 questions
Azure Load Balancer
Azure Load Balancer
An Azure service that delivers high availability and network performance to applications.
410 questions
{count} votes

2 answers

Sort by: Most helpful
  1. Eddie Neto 1,236 Reputation points Microsoft Employee
    2023-05-31T15:25:15.9866667+00:00

    Hi @Tanul

    Thanks for reaching Microsoft Q&A.

    Regarding your question:

    Azure Load Balancer has the following idle timeout range:

    • 4 minutes to 100 minutes for Outbound Rules
    • 4 minutes to 30 minutes for Load Balancer rules and Inbound NAT rules

    By default, it's set to 4 minutes. If a period of inactivity is longer than the timeout value, there's no guarantee that the TCP or HTTP session is maintained between the client and your cloud service.

    When the connection is closed, your client application may receive the following error message: "The underlying connection was closed: A connection that was expected to be kept alive was closed by the server."

    A common practice is to use a TCP keep-alive. This practice keeps the connection active for a longer period. With keep-alive enabled, packets are sent during periods of inactivity on the connection. Keep-alive packets ensure the idle timeout value isn't reached and the connection is maintained for a long period.

    The setting works for inbound connections only. To avoid losing the connection, configure the TCP keep-alive with an interval less than the idle timeout setting or increase the idle timeout value. To support these scenarios, support for a configurable idle timeout has been added.

    Follow our documentation -> https://learn.microsoft.com/en-us/azure/aks/load-balancer-standard#configure-the-load-balancer-idle-timeout

    User's image

    The following examples show how the number of outbound ports and IP addresses are affected by the values you set:

    • If the default values are used and the cluster has 48 nodes, each node will have 1024 ports available.
    • If the default values are used and the cluster scales from 48 to 52 nodes, each node will be updated from 1024 ports available to 512 ports available.
    • If the number of outbound ports is set to 1,000 and the outbound IP count is set to 2, then the cluster can support a maximum of 128 nodes: 64,000 ports per IP / 1,000 ports per node * 2 IPs = 128 nodes.
    • If the number of outbound ports is set to 1,000 and the outbound IP count is set to 7, then the cluster can support a maximum of 448 nodes: 64,000 ports per IP / 1,000 ports per node * 7 IPs = 448 nodes.
    • If the number of outbound ports is set to 4,000 and the outbound IP count is set to 2, then the cluster can support a maximum of 32 nodes: 64,000 ports per IP / 4,000 ports per node * 2 IPs = 32 nodes.
    • If the number of outbound ports is set to 4,000 and the outbound IP count is set to 7, then the cluster can support a maximum of 112 nodes: 64,000 ports per IP / 4,000 ports per node * 7 IPs = 112 nodes.

    Hope this helps. Please "Accept as Answer" if it helped, so that it can help others in the community looking for help on similar topics.

    0 comments No comments

  2. Alessandro AFFINITO (PTV Group) 0 Reputation points
    2024-01-17T16:22:47.7666667+00:00

    Have you tried with a virtual network NAT?

    https://www.danielstechblog.io/preventing-snat-port-exhaustion-on-azure-kubernetes-service-with-virtual-network-nat/

    EDIT: it still requires a NAT GW under the hood

    0 comments No comments