Troubleshoot network problems in AKS clusters

Network problems can occur in new installations of Kubernetes or when you increase the Kubernetes load. Other problems that relate back to networking problems might also occur. Always check the AKS troubleshooting guide to see whether your problem is described there. This article describes additional details and considerations from a network troubleshooting perspective and specific problems that might arise.

Client can't reach the API server

These errors involve connection problems that occur when you can't reach an Azure Kubernetes Service (AKS) cluster's API server through the Kubernetes cluster command-line tool (kubectl) or any other tool, like the REST API via a programming language.

Error

You might see errors that look like these:

Unable to connect to the server: dial tcp <API-server-IP>:443: i/o timeout 
Unable to connect to the server: dial tcp <API-server-IP>:443: connectex: A connection attempt
failed because the connected party did not properly respond after a period, or established 
connection failed because connected host has failed to respond. 

Cause 1

It's possible that IP ranges authorized by the API server are enabled on the cluster's API server, but the client's IP address isn't included in those IP ranges. To determine whether IP ranges are enabled, use the following az aks show command in Azure CLI. If the IP ranges are enabled, the command will produce a list of IP ranges.

az aks show --resource-group <cluster-resource-group> \ 
    --name <cluster-name> \ 
    --query apiServerAccessProfile.authorizedIpRanges 

Solution 1

Ensure that your client's IP address is within the ranges authorized by the cluster's API server:

  1. Find your local IP address. For information on how to find it on Windows and Linux, see How to find my IP.

  2. Update the range that's authorized by the API server by using the az aks update command in Azure CLI. Authorize your client's IP address. For instructions, see Update a cluster's API server authorized IP ranges.

Cause 2

If your AKS cluster is a private cluster, the API server endpoint doesn't have a public IP address. You need to use a VM that has network access to the AKS cluster's virtual network.

Solution 2

For information on how to resolve this problem, see options for connecting to a private cluster.

Pod fails to allocate the IP address

Error

The Pod is stuck in the ContainerCreating state, and its events report a Failed to allocate address error:

Normal   SandboxChanged          5m (x74 over 8m)    kubelet, k8s-agentpool-00011101-0 Pod sandbox
changed, it will be killed and re-created. 

  Warning  FailedCreatePodSandBox  21s (x204 over 8m)  kubelet, k8s-agentpool-00011101-0 Failed 
create pod sandbox: rpc error: code = Unknown desc = NetworkPlugin cni failed to set up pod 
"deployment-azuredisk6-874857994-487td_default" network: Failed to allocate address: Failed to 
delegate: Failed to allocate address: No available addresses 

Or a not enough IPs available error:

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox 
'ac1b1354613465324654c1588ac64f1a756aa32f14732246ac4132133ba21364': plugin type='azure-vnet' 
failed (add): IPAM Invoker Add failed with error: Failed to get IP address from CNS with error: 
%w: AllocateIPConfig failed: not enough IPs available for 9c6a7f37-dd43-4f7c-a01f-1ff41653609c, 
waiting on Azure CNS to allocate more with NC Status: , IP config request is [IPConfigRequest: 
DesiredIPAddress , PodInterfaceID a1876957-eth0, InfraContainerID 
a1231464635654a123646565456cc146841c1313546a515432161a45a5316541, OrchestratorContext 
{'PodName':'a_podname','PodNamespace':'my_namespace'}]

Check the allocated IP addresses in the plugin IPAM store. You might find that all IP addresses are allocated, but the number is much less than the number of running Pods:

If using kubenet:

# Kubenet, for example. The actual path of the IPAM store file depends on network plugin implementation. 
chroot /host/
ls -la "/var/lib/cni/networks/$(ls /var/lib/cni/networks/ | grep -e "k8s-pod-network" -e "kubenet")" | grep -v -e "lock\|last\|total" -e '\.$' | wc -l
244

Note

For kubenet without Calico, the path is /var/lib/cni/networks/kubenet. For kubenet with Calico, the path is /var/lib/cni/networks/k8s-pod-network. The script above will auto select the path while executing the command.

# Check running Pod IPs
kubectl get pods --field-selector spec.nodeName=<your_node_name>,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
7 

If using Azure CNI for dynamic IP allocation:

kubectl get nnc -n kube-system -o wide
NAME                               REQUESTED IPS  ALLOCATED IPS  SUBNET  SUBNET CIDR   NC ID                                 NC MODE  NC TYPE  NC VERSION
aks-agentpool-12345678-vmss000000  32             32             subnet  10.18.0.0/15  559e239d-f744-4f84-bbe0-c7c6fd12ec17  dynamic  vnet     1
# Check running Pod IPs
kubectl get pods --field-selector spec.nodeName=aks-agentpool-12345678-vmss000000,status.phase=Running -A -o json | jq -r '.items[] | select(.spec.hostNetwork != 'true').status.podIP' | wc -l
21

Cause 1

This error can be caused by a bug in the network plugin. The plugin can fail to deallocate the IP address when a Pod is terminated.

Solution 1

Contact Microsoft for a workaround or fix.

Cause 2

Pod creation is much faster than garbage collection of terminated Pods.

Solution 2

Configure fast garbage collection for the kubelet. For instructions, see the Kubernetes garbage collection documentation.

Service not accessible within Pods

The first step to resolving this problem is to check whether endpoints have been created automatically for the service:

kubectl get endpoints <service-name> 

If you get an empty result, your service's label selector might be wrong. Confirm that the label is correct:

# Query Service LabelSelector. 
kubectl get svc <service-name> -o jsonpath='{.spec.selector}' 

# Get Pods matching the LabelSelector and check whether they're running. 
kubectl get pods -l key1=value1,key2=value2 

If the preceding steps return expected values:

  • Check whether the Pod containerPort is the same as the service containerPort.

  • Check whether podIP:containerPort is working:

    # Testing via cURL. 
    curl -v telnet ://<Pod-IP>:<containerPort>
    
    # Testing via Telnet. 
    telnet <Pod-IP>:<containerPort> 
    

These are some other potential causes of service problems:

  • The container isn't listening to the specified containerPort. (Check the Pod description.)
  • A CNI plugin error or network route error is occurring.
  • kube-proxy isn't running or iptables rules aren't configured correctly.
  • Network Policies is dropping traffic. For information on applying and testing Network Policies, see Azure Kubernetes Network Policies overview.
    • If you're using Calico as your network plugin, you can capture network policy traffic as well. For information on configuring that, see the Calico site.

Nodes can't reach the API server

Many add-ons and containers need to access the Kubernetes API (for example, kube-dns and operator containers). If errors occur during this process, the following steps can help you determine the source of the problem.

First, confirm whether the Kubernetes API is accessible within Pods:

kubectl run curl --image=mcr.microsoft.com/azure-cli -i -t --restart=Never --overrides='[{"op":"add","path":"/spec/containers/0/resources","value":{"limits":{"cpu":"200m","memory":"128Mi"}}}]' --override-type json --command -- sh

Then execute the following from within the container that you now are shelled into.

# If you don't see a command prompt, try selecting Enter. 
KUBE_TOKEN=$(cat /var/run/secrets/kubernetes.io/serviceaccount/token) 
curl -sSk -H "Authorization: Bearer $KUBE_TOKEN" https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT/api/v1/namespaces/default/pods

Healthy output will look similar to the following.

{ 
  "kind": "PodList", 
  "apiVersion": "v1", 
  "metadata": { 
    "selfLink": "/api/v1/namespaces/default/pods", 
    "resourceVersion": "2285" 
  }, 
  "items": [ 
   ... 
  ] 
} 

If an error occurs, check whether the kubernetes-internal service and its endpoints are healthy:

kubectl get service kubernetes-internal
NAME                TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE 
kubernetes-internal ClusterIP   10.96.0.1    <none>        443/TCP   25m 
kubectl get endpoints kubernetes-internal
NAME                ENDPOINTS          AGE 
kubernetes-internal 172.17.0.62:6443   25m 

If both tests return responses like the preceding ones, and the IP and port returned match the ones for your container, it's likely that kube-apiserver isn't running or is blocked from the network.

There are four main reasons why the access might be blocked:

You can also check kube-apiserver logs by using Container insights. For information on querying kube-apiserver logs, and many other queries, see How to query logs from Container insights.

Finally, you can check the kube-apiserver status and its logs on the cluster itself:

# Check kube-apiserver status. 
kubectl -n kube-system get pod -l component=kube-apiserver 

# Get kube-apiserver logs. 
PODNAME=$(kubectl -n kube-system get pod -l component=kube-apiserver -o jsonpath='{.items[0].metadata.name}')
kubectl -n kube-system logs $PODNAME --tail 100

If a 403 - Forbidden error returns, kube-apiserver is probably configured with role-based access control (RBAC) and your container's ServiceAccount probably isn't authorized to access resources. In this case, you should create appropriate RoleBinding and ClusterRoleBinding objects. For information about roles and role bindings, see Access and identity. For examples of how to configure RBAC on your cluster, see Using RBAC Authorization.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Other contributors:

Next steps