Troubleshoot connection issues to an app that's hosted in an AKS cluster

Стаття
10/08/2024

In current dynamic cloud environments, ensuring seamless connectivity to applications hosted in Azure Kubernetes Service (AKS) clusters is crucial for maintaining optimal performance and user experience. This article covers how to troubleshoot and resolve connectivity issues caused by various factors, including application-side problems, network policies, Network security group (NSG) rules or others.

Note

To troubleshoot common issues when you try to connect to the AKS API server, see Basic troubleshooting of cluster connection issues with the API server.

Prerequisites

The Client URL (cURL) tool, or a similar command-line tool.
The apt-get command-line tool for handling packages.
The Netcat (nc) command-line tool for TCP connections.
The Kubernetes kubectl tool, or a similar tool to connect to the cluster. To install kubectl by using Azure CLI, run the az aks install-cli command.

Factors to consider

This section covers troubleshooting steps to take if you're having issues when you try to connect to the application that's hosted in an AKS cluster.

In any networking scenario, administrators should consider the following important factors when troubleshooting:

What's the source and the destination for a request?
What are the hops in between the source and the destination?
What's the request-response flow?
Which hops have extra security layers on top, such as the following items:
- Firewall
- Network security group (NSG)
- Network policy

When you check each component, get and analyze HTTP response codes. These codes are useful to identify the nature of the issue, and are especially helpful in scenarios in which the application responds to HTTP requests.

If other troubleshooting steps don't provide any conclusive outcome, take packet captures from the client and server. Packet captures are also useful when non-HTTP traffic is involved between the client and server. For more information about how to collect packet captures for AKS environment, see the following articles in the data collection guide:

Knowing how to get the HTTP response codes and take packet captures makes it easier to troubleshoot a network connectivity issue.

Basic network flow for applications on AKS

In general, when applications are exposed using the Azure Load Balancer service type, the request flow to access them is as follows:

Client >> DNS name >> AKS load balancer IP address >> AKS nodes >> Pods

There are other possible situations in which extra components might be involved. For example:

The managed NGINX ingress with the application routing add-on feature is enabled.
The application gateway is used through the Application Gateway Ingress Controller (AGIC) instead of Azure Load Balancer.
Azure Front Door and API Management might be used on top of the load balancer.
The process uses an internal load balancer.
The connection might not end at the pod and the requested URL. This could depend on whether the pod can connect to another entity, such as a database or any other service in the same cluster.

It's important to understand the request flow for the application.

A basic request flow to applications in an AKS cluster would resemble the flow that's shown in the following diagram.

Inside-out troubleshooting

Troubleshooting connectivity issues might involve many checks, but the inside-out approach can help find the source of the issue and identify the bottleneck. In this approach, you start at the pod itself, checking whether the application is responding on the pod's IP address. Then, check each component in turn up to the end client.

Step 1: Check whether the pod is running and the app or container inside the pod is responding correctly

To determine whether the pod is running, run one of the following kubectl get commands:

# List pods in the specified namespace.
kubectl get pods -n <namespace-name>

# List pods in all namespaces.
kubectl get pods -A

What if the pod isn't running? In this case, check the pod events by using the kubectl describe command:

kubectl describe pod <pod-name> -n <namespace-name>

If the pod isn't in a Ready or Running state, or it restarted many times, check the kubectl describe output. The events will reveal any issues that prevent you from being able to start the pod. Or, if the pod has started, the application inside the pod might have failed, causing the pod to be restarted. Troubleshoot the pod accordingly to make sure that it's in a suitable state.

If the pod is running, it can also be useful to check the logs of the containers that are inside the pod. Run the following series of kubectl logs commands:

kubectl logs <pod-name> -n <namespace-name>

# Check logs for an individual container in a multicontainer pod.
kubectl logs <pod-name> -n <namespace-name> -c <container-name>

# Dump pod logs (stdout) for a previous container instance.
kubectl logs <pod-name> --previous                      

# Dump pod container logs (stdout, multicontainer case) for a previous container instance.
kubectl logs <pod-name> -c <container-name> --previous

Is the pod running? In this case, test the connectivity by starting a test pod in the cluster. From the test pod, you can directly access the application's pod IP address and check whether the application is responding correctly. Run the kubectl run, apt-get, and cURL commands as follows:

# Start a test pod in the cluster:
kubectl run -it --rm aks-ssh --image=debian:stable

# After the test pod is running, you will gain access to the pod.
# Then you can run the following commands:
apt-get update -y && apt-get install dnsutils -y && apt-get install curl -y && apt-get install netcat-traditional -y

# After the packages are installed, test the connectivity to the application pod:
curl -Iv http://<pod-ip-address>:<port>

For applications that listen on other protocols, you can install relevant tools inside the test pod like the netcat tool, and then check the connectivity to the application pod by running the following command:

# After the packages are installed, test the connectivity to the application pod using netcat/nc command:
nc -z -v <pod-ip-address> <port>

For more commands to troubleshoot pods, see Debug running pods.

Step 2: Check whether the application is reachable from the service

For scenarios in which the application inside the pod is running, you can focus mainly on troubleshooting how the pod is exposed.

Is the pod exposed as a service? In this case, check the service events. Also, check whether the pod IP address and application port are available as an endpoint in the service description:

# Check the service details.
kubectl get svc -n <namespace-name>

# Describe the service.
kubectl describe svc <service-name> -n <namespace-name>

Check whether the pod's IP address is present as an endpoint in the service, as in the following example:

$ kubectl get pods -o wide  # Check the pod's IP address.
NAME            READY   STATUS        RESTARTS   AGE   IP            NODE                                
my-pod          1/1     Running       0          12m   10.244.0.15   aks-agentpool-000000-vmss000000  

$ kubectl describe service my-cluster-ip-service  # Check the endpoints in the service.
Name:              my-cluster-ip-service
Namespace:         default
Selector:          app=my-pod
Type:              ClusterIP
IP Family Policy:  SingleStack
IP Families:       IPv4
IP:                10.0.174.133
IPs:               10.0.174.133
Port:              <unset>  80/TCP
TargetPort:        80/TCP
Endpoints:         10.244.0.15:80     # <--- Here

$ kubectl get endpoints  # Check the endpoints directly for verification.
NAME                      ENDPOINTS           AGE
my-cluster-ip-service     10.244.0.15:80      14m

If the endpoints aren't pointing to the correct pod IP address, verify the Labels and Selectors for the pod and the service.

Are the endpoints in the service correct? If so, access the service, and check whether the application is reachable.

Access the ClusterIP service

For the ClusterIP service, you can start a test pod in the cluster and access the service IP address:

# Start a test pod in the cluster:
kubectl run -it --rm aks-ssh --image=debian:stable
  
# After the test pod is running, you will gain access to the pod.
# Then, you can run the following commands:
apt-get update -y && apt-get install dnsutils -y && apt-get install curl -y && apt-get install netcat-traditional -y
  
# After the packages are installed, test the connectivity to the service:
curl -Iv http://<service-ip-address>:<port>

# After the packages are installed, test the connectivity to the application pod using netcat/nc command:
nc -z -v <pod-ip-address> <port>

If the previous command doesn't return an appropriate response, check the service events for any errors.

Access the LoadBalancer service

For the LoadBalancer service, you can access the load balancer IP address from outside the cluster.

curl -Iv http://<service-ip-address>:<port>

nc -z -v <pod-ip-address> <port>

Does the LoadBalancer service IP address return a correct response? If it doesn't, follow these steps:

Verify the events of the service.
Verify that the network security groups (NSGs) that are associated with the AKS nodes and AKS subnet allow the incoming traffic on the service port.

For more commands to troubleshoot services, see Debug services.

Scenarios that use an ingress instead of a service

For scenarios in which the application is exposed by using an Ingress resource, the traffic flow resembles the following progression:

Client >> DNS name >> Load balancer or application gateway IP address >> Ingress controller pods inside the cluster >> Service or pods

You can apply the inside-out approach of troubleshooting here, too. You can also check the ingress kubernetes resource and ingress controller details for more information:

$ kubectl get ing -n <namespace-of-ingress>  # Checking the ingress details and events.
NAME                         CLASS    HOSTS                ADDRESS       PORTS     AGE
hello-world-ingress          <none>   myapp.com            20.84.x.x     80, 443   7d22h

$ kubectl describe ing -n <namespace-of-ingress> hello-world-ingress
Name:             hello-world-ingress
Namespace:        <namespace-of-ingress>
Address:          20.84.x.x
Default backend:  default-http-backend:80 (<error: endpoints "default-http-backend" not found>)
TLS:
  tls-secret terminates myapp.com
Rules:
  Host                Path  Backends
  ----                ----  --------
  myapp.com
                      /blog   blog-service:80 (10.244.0.35:80)
                      /store  store-service:80 (10.244.0.33:80)

Annotations:          cert-manager.io/cluster-issuer: letsencrypt
                      kubernetes.io/ingress.class: nginx
                      nginx.ingress.kubernetes.io/rewrite-target: /$1
                      nginx.ingress.kubernetes.io/use-regex: true
Events:
  Type    Reason  Age    From                      Message
  ----    ------  ----   ----                      -------
  Normal  Sync    5m41s  nginx-ingress-controller  Scheduled for sync
  Normal  Sync    5m41s  nginx-ingress-controller  Scheduled for sync

This example contains an Ingress resource that:

Listens on the myapp.com host.
Has two Path strings configured.
Routes to two Services in the back end.

Verify that the back-end services are running, and respond to the port that's mentioned in the ingress description:

$ kubectl get svc -n <namespace-of-ingress>
NAMESPACE       NAME                                     TYPE           CLUSTER-IP     EXTERNAL-IP   PORT(S)                      
ingress-basic   blog-service                             ClusterIP      10.0.155.154   <none>        80/TCP                       
ingress-basic   store-service                            ClusterIP      10.0.61.185    <none>        80/TCP             
ingress-basic   nginx-ingress-ingress-nginx-controller   LoadBalancer   10.0.122.148   20.84.x.x     80:30217/TCP,443:32464/TCP

Verify the logs for the ingress controller pods if there's an error:

$ kubectl get pods -n <namespace-of-ingress>  # Get the ingress controller pods.
NAME                                                     READY   STATUS    RESTARTS   AGE
aks-helloworld-one-56c7b8d79d-6zktl                      1/1     Running   0          31h
aks-helloworld-two-58bbb47f58-rrcv7                      1/1     Running   0          31h
nginx-ingress-ingress-nginx-controller-9d8d5c57d-9vn8q   1/1     Running   0          31h
nginx-ingress-ingress-nginx-controller-9d8d5c57d-grzdr   1/1     Running   0          31h

$ # Check logs from the pods.
$ kubectl logs -n ingress-basic nginx-ingress-ingress-nginx-controller-9d8d5c57d-9vn8q

What if the client is making requests to the ingress host name or IP address, but no entries are seen in the logs of the ingress controller pod? In this case, the requests might not be reaching the cluster, and the user might be receiving a Connection Timed Out error message.

Another possibility is that the components on top of the ingress pods, such as Load Balancer or Application Gateway, aren't routing the requests to the cluster correctly. If this is true, you can check the back-end configuration of these resources.

If you receive a Connection Timed Out error message, check the network security group that's associated with the AKS nodes. Also, check the AKS subnet. It could be blocking the traffic from the load balancer or application gateway to the AKS nodes.

For more information about how to troubleshoot ingress (such as Nginx Ingress), see ingress-nginx troubleshooting.

Contact us for help

If you have questions or need help, create a support request, or ask Azure community support. You can also submit product feedback to Azure feedback community.

Third-party contact disclaimer

Microsoft provides third-party contact information to help you find additional information about this topic. This contact information may change without notice. Microsoft does not guarantee the accuracy of third-party contact information.

Поділитися через