在 AKS 上存取應用程式時發生間歇性逾時或伺服器問題

發行項
04/12/2024

本文說明如何針對會影響裝載於 Azure Kubernetes Service (AKS) 叢集上之應用程式的間歇性連線問題進行疑難解答。

必要條件

用戶端 URL (cURL) 工具或類似的命令行工具。
Kubernetes kubectl 工具，或連線到叢集的類似工具。若要使用 Azure CLI 安裝 kubectl，請執行 az aks install-cli 命令。

徵狀

當您執行 cURL 命令時，偶爾會收到「逾時」錯誤訊息。輸出可能類似下列文字：

$ # One connection is successful, which results in a HTTP 200 response.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* Connected to 20.62.x.x (20.62.x.x) port 80 (#0)
...
...
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

$ # Another connection is unsuccessful, because it gets timed out.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* connect to 20.62.x.x port 80 failed: Timed out
* Failed to connect to 20.62.x.x port 80 after 21050 ms: Timed out
* Closing connection 0
curl: (28) Failed to connect to 20.62.x.x port 80 after 21050 ms: Timed out

$ # Then the next connection is again successful.
$ curl -Iv http://20.62.x.x
*   Trying 20.62.x.x:80...
* Connected to 20.62.x.x (20.62.x.x) port 80 (#0)
...
...
< HTTP/1.1 200 OK
HTTP/1.1 200 OK

原因

間歇性逾時會建議元件效能問題，而不是網路問題。

在此案例中，請務必檢查元件的使用情況和健康情況。您可以使用內建技術來檢查 Pod 的狀態。執行 kubectl top 和 kubectl get 命令，如下所示：

$ kubectl top pods  # Check the health of the pods and the nodes.
NAME                            CPU(cores)   MEMORY(bytes)
my-deployment-fc94b7f98-m9z2l   1m           32Mi

$ kubectl top nodes
NAME                                CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%
aks-agentpool-42617579-vmss000000   120m         6%     2277Mi          49%

$ kubectl get pods  # Check the state of the pod.
NAME                            READY   STATUS    RESTARTS   AGE
my-deployment-fc94b7f98-m9z2l   2/2     Running   1          108s

輸出顯示目前 Pod 和節點的使用方式似乎是可接受的。

雖然 Pod 處於 Running 狀態，但在執行 Pod 的前 108 秒之後，就會重新啟動一次。此情況可能表示某些問題會影響在 Pod 中執行的 Pod 或容器。

如果問題持續發生，Pod 的狀態會在一段時間后變更：

$ kubectl get pods
NAME                            READY   STATUS             RESTARTS   AGE
my-deployment-fc94b7f98-m9z2l   1/2     CrashLoopBackOff   42         3h53m

此範例顯示 Ready 狀態已變更，而且 Pod 有數次重新啟動。其中一個容器處於 CrashLoopBackOff 狀態。

發生這種情況是因為容器在啟動后失敗，然後 Kubernetes 會嘗試重新啟動容器，以強制它開始運作。不過，如果問題持續發生，應用程式會在執行一段時間之後繼續失敗。 Kubernetes 最終會將狀態變更為 CrashLoopBackOff。

若要檢查 Pod 的記錄，請執行下列 kubectl logs 命令：

$ kubectl logs my-deployment-fc94b7f98-m9z2l
error: a container name must be specified for pod my-deployment-fc94b7f98-m9z2l, choose one of: [webserver my-app]

$ # Since the pod has more than one container, the name of the container has to be specified.
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c webserver
[...] [mpm_event:notice] [pid 1:tid 140342576676160] AH00489: Apache/2.4.52 (Unix) configured -- resuming normal operations
[...] [core:notice] [pid 1:tid 140342576676160] AH00094: Command line: 'httpd -D FOREGROUND'
10.244.0.1 - - ... "GET / HTTP/1.1" 200 45
10.244.0.1 - - ... "GET /favicon.ico HTTP/1.1" 404 196
10.244.0.1 - - ... "-" 408 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "HEAD / HTTP/1.1" 200 -
10.244.0.1 - - ... "POST /boaform/admin/formLogin HTTP/1.1" 404 196

$ # The webserver container is running fine. Check the logs for other container (my-app).
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c my-app

$ # No logs observed. The container could be starting or be in a transition phase.
$ # So logs for the previous execution of this container can be checked using the --previous flag:
$ kubectl logs my-deployment-fc94b7f98-m9z2l -c my-app --previous
<Some Logs from the container>
..
..
Started increasing memory

在上一次執行容器時，已建立記錄專案。這些專案的存在表示應用程式已啟動，但因某些問題而關閉。

下一個步驟是執行 kubectl describe 命令來檢查 Pod 的事件：

$ kubectl describe pod my-deployment-fc94b7f98-m9z2l
Name:         my-deployment-fc94b7f98-m9z2l
Namespace:    default
...
...
Labels:       app=my-pod
...
...
Containers:
  webserver:
 ...
 ...
  my-app:
    Container ID:   containerd://a46e5062d53039d0d812c57c76b740f8d1ffb222de35203575bf8e4d10d6b51e
    Image:          my-repo/my-image:latest
    Image ID:       docker.io/my-repo/my-image@sha256:edcc4bedc7b...
    State:          Running
      Started:      <Start Date>
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
    Ready:          True
    Restart Count:  44
    Limits:
      memory:  500Mi
    Requests:
      cpu:        250m
      memory:     500Mi
...
...
Events:
  Type     Reason   Age                     From     Message
  ----     ------   ----                    ----     -------
  Normal   Pulling  49m (x37 over 4h4m)     kubelet  Pulling image "my-repo/my-image:latest"
  Warning  BackOff  4m10s (x902 over 4h2m)  kubelet  Back-off restarting failed container

觀察：

結束代碼為 137。如需結束代碼的詳細資訊，請參閱 Docker 執行參考和具有特殊意義的結束代碼。
終止原因為 OOMKilled。
為容器指定的記憶體限制為 500 Mi。

您可以從事件中得知容器因為超過記憶體限制而終止。當達到容器記憶體限制時，應用程式會間歇性地無法存取，而且容器會終止並重新啟動。

解決方案

您可以移除記憶體限制並監視應用程式，以判斷它實際需要多少記憶體。瞭解記憶體使用量之後，您可以更新容器上的記憶體限制。如果記憶體使用量持續增加，請判斷應用程式中是否有記憶體流失。

如需如何在 Azure Kubernetes Service 中規劃工作負載資源的詳細資訊，請參閱資源管理最佳做法。

與我們連絡，以取得說明

如果您有問題或需要相關協助，請建立支援要求，或詢問 Azure community 支援。您也可以將產品意見反應提交給 Azure 意應見反社群。

在 AKS 上存取應用程式時發生間歇性逾時或伺服器問題

必要條件

徵狀

原因

解決方案

與我們連絡，以取得說明

意見反應

意見反應

其他資源