TCP time-outs when kubectl or other third-party tools connect to the API server

Alt
10/28/2024

This article discusses how to troubleshoot TCP time-outs that occur when kubectl or other third-party tools are used to connect to the API server in Microsoft Azure Kubernetes Service (AKS). To ensure its service-level objectives (SLOs) and service-level agreements (SLAs), AKS uses high-availability (HA) control planes that scale vertically and horizontally, based on the number of cores.

Symptoms

You experience repeated connection time-outs.

Cause 1: Pods that are responsible for node-to-control plane communication aren't running

If only a few of your API commands are timing out consistently, the following pods might not be in a running state:

konnectivity-agent
tunnelfront
aks-link

Note

In newer AKS versions, tunnelfront and aks-link are replaced with konnectivity-agent, so you'll only see konnectivity-agent.

These pods are responsible for communication between a node and the control plane.

Solution: Reduce the utilization or stress of the node hosts

Make sure the nodes that host these pods aren't overly utilized or under stress. Consider moving the nodes to their own system node pool.

To check which node the konnectivity-agent pod is hosted on and the usage of the node, run the following commands:

# Check which node the konnectivity-agent pod is hosted on
$ kubectl get pod -n kube-system -o wide
    
# Check the usage of the node hosting the pod
$ kubectl top node

Cause 2: Access is blocked on some required ports, FQDNs, and IP addresses

If the required ports, fully qualified domain names (FQDNs), and IP addresses aren't all opened, several command calls might fail. Secure, tunneled communication on AKS between the API server and the kubelet (through the konnectivity-agent pod) requires some of those items to work successfully.

Solution: Open the necessary ports, FQDNs, and IP addresses

For more information about what ports, FQDNs, and IP addresses need to be opened, see Outbound network and FQDN rules for Azure Kubernetes Service (AKS) clusters.

Cause 3: The Application-Layer Protocol Negotiation TLS extension is blocked

To establish a connection between the control plane and nodes, the konnectivity-agent pod requires the Transport Layer Security (TLS) extension for Application-Layer Protocol Negotiation (ALPN). You might have previously blocked this extension.

Solution: Enable the ALPN extension

Enable the ALPN extension on the konnectivity-agent pod to prevent TCP time-outs.

Cause 4: The API server's IP authorized ranges doesn't cover your current IP address

If you use authorized IP address ranges on your API server, your API calls will be blocked if your IP isn't included in the authorized ranges.

Solution: Modify the authorized IP address ranges so that it covers your IP address

Change the authorized IP address ranges so that your IP address is covered. For more information, see Update a cluster's API server authorized IP ranges.

Cause 5: A client or application leaks calls to the API server

Frequent GET calls can accumulate and overload the API server.

Solution: Use watches instead of GET calls, but make sure the application doesn't leak those calls

Make sure that you use watches instead of frequent GET calls to the API server. You also have to make sure that your third-party applications don't leak any watch connections or GET calls. For example, in the Istio microservice architecture, a bug in the mixer application creates a new API server watch connection whenever a secret is read internally. Because this behavior happens at a regular interval, the watch connections quickly accumulate. These connections eventually cause the API server to become overloaded no matter the scaling pattern.

Cause 6: Too many releases in your Helm deployments

If you use too many releases in your deployments of Helm (the Kubernetes package manager), the nodes start to consume too much memory. It also results in a large amount of ConfigMap (configuration data) objects, which might cause unnecessary usage spikes on the API server.

Solution: Limit the maximum number of revisions for each release

Because the maximum number of revisions for each release is infinite by default, you need to run a command to set this maximum number to a reasonable value. For Helm 2, the command is helm init. For Helm 3, the command is helm upgrade. Set the --history-max <value> parameter when you run the command.

Version	Command
Helm 2	`helm init --history-max <maximum-number-of-revisions-per-release> ...`
Helm 3	`helm upgrade ... --history-max <maximum-number-of-revisions-per-release> ...`

Cause 7: Internal traffic between nodes is being blocked

There might be internal traffic blockages between nodes in your AKS cluster.

Solution: Troubleshoot the "dial tcp <Node_IP>:10250: i/o timeout" error

See Troubleshoot TCP timeouts, such as "dial tcp <Node_IP>:10250: i/o timeout".

Cause 8: Your cluster is private

Your cluster is a private cluster, but the client from which you're trying to access the API server is in a public or different network that can't connect to the subnet used by AKS.

Solution: Use a client that can access the AKS subnet

Since your cluster is private and its control plane is in the AKS subnet, it can't be connected to the API server unless it's in a network that can connect to the AKS subnet. It's an expected behavior.

In this case, try to access the API server from a client in a network that can communicate with the AKS subnet. Additionally, verify network security groups (NSGs) or other appliances between networks aren't blocking packets.

Third-party information disclaimer

The third-party products that this article discusses are manufactured by companies that are independent of Microsoft. Microsoft makes no warranty, implied or otherwise, about the performance or reliability of these products.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.

Comhroinn trí