访问 AKS 上的应用程序时出现间歇性超时或服务器问题

项目
04/12/2024

本文介绍如何排查影响 Azure Kubernetes 服务 (AKS) 群集上托管的应用程序的间歇性连接问题。

先决条件

客户端 URL (cURL) 工具或类似的命令行工具。
Kubernetes kubectl 工具或用于连接到群集的类似工具。若要使用 Azure CLI 安装 kubectl，请运行 az aks install-cli 命令。

症状

运行 cURL 命令时，偶尔会收到“超时”错误消息。输出可能类似于以下文本：

$ # One connection is successful, which results in a HTTP 200 response.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* Connected to 20.62.x.x (20.62.x.x) port 80 (#0)
...
...
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

$ # Another connection is unsuccessful, because it gets timed out.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* connect to 20.62.x.x port 80 failed: Timed out
* Failed to connect to 20.62.x.x port 80 after 21050 ms: Timed out
* Closing connection 0
curl: (28) Failed to connect to 20.62.x.x port 80 after 21050 ms: Timed out

$ # Then the next connection is again successful.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* Connected to 20.62.x.x (20.62.x.x) port 80 (#0)
...
...
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

原因

间歇性超时表明组件性能问题，而不是网络问题。

在此方案中，请务必检查组件的使用情况和运行状况。可以使用内向外技术检查 Pod 的状态。运行 kubectl top 和 kubectl get 命令，如下所示：

$ kubectl top pods  # Check the health of the pods and the nodes.
NAME                            CPU(cores)   MEMORY(bytes)
my-deployment-fc94b7f98-m9z2l   1m           32Mi

$ kubectl top nodes
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-agentpool-42617579-vmss000000   120m         6%     2277Mi          49%

$ kubectl get pods  # Check the state of the pod.
NAME                            READY   STATUS    RESTARTS   AGE
my-deployment-fc94b7f98-m9z2l   2/2     Running   1          108s

输出显示 Pod 和节点的当前使用情况似乎可以接受。

虽然 Pod 处于状态 Running ，但在运行 Pod 的前 108 秒后会重启。此事件可能表示某些问题会影响 Pod 中运行的 Pod 或容器。

如果问题仍然存在，Pod 的状态会在一段时间后更改：

$ kubectl get pods
NAME                            READY   STATUS             RESTARTS   AGE
my-deployment-fc94b7f98-m9z2l   1/2     CrashLoopBackOff   42         3h53m

此示例显示 Ready 状态已更改，并且 Pod 会多次重启。其中一个容器处于 CrashLoopBackOff 状态。

出现这种情况的原因是容器在启动后失败，然后 Kubernetes 会尝试重启容器以强制其开始工作。但是，如果问题仍然存在，应用程序在运行一段时间后将继续失败。 Kubernetes 最终将状态更改为 CrashLoopBackOff。

若要检查 Pod 的日志，请运行以下 kubectl logs 命令：

$ kubectl logs my-deployment-fc94b7f98-m9z2l
error: a container name must be specified for pod my-deployment-fc94b7f98-m9z2l, choose one of: [webserver my-app]

$ # Since the pod has more than one container, the name of the container has to be specified.
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c webserver
[...] [mpm_event:notice] [pid 1:tid 140342576676160] AH00489: Apache/2.4.52 (Unix) configured -- resuming normal operations
[...] [core:notice] [pid 1:tid 140342576676160] AH00094: Command line: 'httpd -D FOREGROUND'
10.244.0.1 - - ... "GET / HTTP/1.1" 200 45
10.244.0.1 - - ... "GET /favicon.ico HTTP/1.1" 404 196
10.244.0.1 - - ... "-" 408 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "POST /boaform/admin/formLogin HTTP/1.1" 404 196

$ # The webserver container is running fine. Check the logs for other container (my-app).
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c my-app

$ # No logs observed. The container could be starting or be in a transition phase.
$ # So logs for the previous execution of this container can be checked using the --previous flag:
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c my-app --previous
<Some Logs from the container>
..
..
Started increasing memory

日志条目是在上一次运行容器时创建的。这些条目的存在表明应用程序确实已启动，但由于某些问题而关闭。

下一步是通过运行 kubectl describe 命令来检查 Pod 的事件：

$ kubectl describe pod my-deployment-fc94b7f98-m9z2l
Name:         my-deployment-fc94b7f98-m9z2l
Namespace:    default
...
...
Labels:       app=my-pod
...
...
Containers:
  webserver:
 ...
 ...
  my-app:
    Container ID:   containerd://a46e5062d53039d0d812c57c76b740f8d1ffb222de35203575bf8e4d10d6b51e
    Image:          my-repo/my-image:latest
    Image ID:       docker.io/my-repo/my-image@sha256:edcc4bedc7b...
    State:          Running
      Started:      <Start Date>
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
    Ready:          True
    Restart Count:  44
    Limits:
      memory:  500Mi
    Requests:
      cpu:        250m
      memory:     500Mi
...
...
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  49m (x37 over 4h4m)     kubelet  Pulling image "my-repo/my-image:latest"
  Warning  BackOff  4m10s (x902 over 4h2m)  kubelet  Back-off restarting failed container

观察：

退出代码为 137。有关退出代码的详细信息，请参阅 Docker 运行参考和具有特殊含义的退出代码。
终止原因是 OOMKilled。
为容器指定的内存限制为 500 Mi。

可以从事件中判断容器正在终止，因为它超出了内存限制。达到容器内存限制时，应用程序会间歇性无法访问，并且容器会终止并重新启动。

解决方案

可以删除内存限制并监视应用程序，以确定它实际需要多少内存。了解内存使用情况后，可以更新容器的内存限制。如果内存使用量继续增加，请确定应用程序中是否存在内存泄漏。

有关如何在 Azure Kubernetes 服务中为工作负荷规划资源的详细信息，请参阅资源管理最佳做法。

联系我们寻求帮助

如果你有任何疑问或需要帮助，请创建支持请求或联系 Azure 社区支持。还可以向 Azure 反馈社区提交产品反馈。

通过

访问 AKS 上的应用程序时出现间歇性超时或服务器问题

先决条件

症状

原因

解决方案

联系我们寻求帮助

反馈

其他资源