針對 Azure Machine Learning 延伸模組進行疑難排解

文章
09/04/2024

在本文中，您將了解如何在 AKS 或已啟用 Arc 的 Kubernetes 中部署 Azure Machine Learning 延伸模組時，針對可能遇到的常見問題進行疑難排解。

如何安裝 Azure Machine Learning 延伸項目

Azure Machine Learning 延伸模組會以 Helm 圖表的形式發行，並由 Helm V3 安裝。 Azure Machine Learning 延伸模組的所有元件會安裝在 azureml 命名空間中。您可以使用下列命令來檢查延伸模組狀態。

# get the extension status
az k8s-extension show --name <extension-name>

# check status of all pods of Azure Machine Learning extension
kubectl get pod -n azureml

# get events of the extension
kubectl get events -n azureml --sort-by='.lastTimestamp'

對 Azure Machine Learning 延伸項目部署錯誤進行疑難排解

錯誤：無法重複使用仍在使用中的名稱

此錯誤表示您指定的延伸模組名稱已經存在。如果該名稱由 Azure Machine Learning 延伸模組使用，您需要等候大約一小時的時間，然後再試一次。如果該名稱由其他 Helm 圖表使用，您需要使用另一個名稱。執行 helm list -Aa 以列出叢集中的所有 Helm 圖表。

錯誤：Helm 圖表先前的作業仍在進行中

您需要等候大約一小時的時間，並在未知的作業完成之後再試一次。

錯誤：無法在命名空間 azureml 中建立新內容，因為正在終止

當解除安裝作業未完成且觸發另一個安裝作業時，就會發生此錯誤。您可以執行 az k8s-extension show 命令來檢查延伸模組的佈建狀態，並確定已在採取其他動作之前解除安裝延伸模組。

錯誤：下載圖表時發生錯誤，找不到路徑

指定錯誤的延伸模組版本時，就會發生此錯誤。您必須確定指定的版本存在。如果要使用最新版本，則不需要指定 --version。

錯誤：無法匯入至目前的版本：無效的擁有權中繼資料

此錯誤表示現有叢集資源與 Azure Machine Learning 延伸模組之間發生衝突。完整錯誤訊息可能類似下列文字：

CustomResourceDefinition "jobs.batch.volcano.sh" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; label validation error: missing key "app.kubernetes.io/managed-by": must be set to "Helm"; annotation validation error: missing key "meta.helm.sh/release-name": must be set to "amlarc-extension"; annotation validation error: missing key "meta.helm.sh/release-namespace": must be set to "azureml"

請使用下列步驟以解決此問題。

檢查誰擁有發生問題的資源，以及是否可以刪除或修改該資源。
如果資源僅由 Azure Machine Learning 延伸模組使用，而且可以刪除，您可以手動新增標籤來緩解問題。以先前的錯誤訊息為例，您可以執行命令，如下所示。
```
kubectl label crd jobs.batch.volcano.sh "app.kubernetes.io/managed-by=Helm" 
kubectl annotate crd jobs.batch.volcano.sh "meta.helm.sh/release-namespace=azureml" "meta.helm.sh/release-name=<extension-name>"
```
藉由為資源設定標籤和註釋，這表示 Helm 正在管理資源，並由 Azure Machine Learning 延伸模組所擁有。
當資源也由叢集中的其他元件使用，且無法修改。請參閱部署 Azure Machine Learning 延伸模組，以查看是否有組態設定可停用衝突資源。

延伸項目的 HealthCheck

當安裝失敗且未遇到上述任何錯誤訊息時，您可以使用內建的健康情況檢查工作來全面檢查延伸模組。 Azure 機器學習延伸模組包含 HealthCheck 作業，可在嘗試安裝、更新或刪除延伸模組時，預先檢查叢集的整備程度。 HealthCheck 作業會輸出報表，並儲存在 azureml 命名空間中名為 arcml-healthcheck 的 configmap 中。 HealthCheck 的錯誤碼中會列出報表的錯誤碼和可能的解決方案。

執行此命令以取得 HealthCheck 報表，

kubectl describe configmap -n azureml arcml-healthcheck

每當您安裝、更新或刪除延伸模組時，就會觸發健康情況檢查。健康情況檢查報表的結構包含數個部分 pre-install、pre-rollback、pre-upgrade 和 pre-delete。

如果安裝延伸模組失敗，您應該查看 pre-install 和 pre-delete。
如果延伸模組更新失敗，您應該查看 pre-upgrade 和 pre-rollback。
如果刪除延伸模組失敗，您應該查看 pre-delete。

要求支援時，建議您執行下列命令，並將 healthcheck.logs 檔案傳送給我們，因為這可協助我們更能夠找出問題。

kubectl logs healthcheck -n azureml

HealthCheck 的錯誤碼

下表顯示如何針對 HealthCheck 報表傳回的錯誤碼進行疑難排解。

錯誤碼	錯誤訊息	描述
E40001	LOAD_BALANCER_NOT_SUPPORT	叢集中不支援負載平衡器。您必須在叢集中設定負載平衡器，或考慮將 `inferenceRouterServiceType` 設定為 `nodePort` 或 `clusterIP`。
E40002	INSUFFICIENT_NODE	您已啟用的 `inferenceRouterHA` 要求叢集中至少有三個節點。如果您有少於三個節點，請停用 HA。
E40003	INTERNAL_LOAD_BALANCER_NOT_SUPPORT	目前只有 AKS 支援內部負載平衡器，且僅支援 `azure` 類型。如果您沒有 AKS 叢集，請勿設定 `internalLoadBalancerProvider`。
E40007	INVALID_SSL_SETTING	SSL 金鑰或憑證無效。 CNAME 應該與憑證相容。
E45002	PROMETHEUS_CONFLICT	安裝的 Prometheus Operator 與現有的 Prometheus Operator 衝突。如需詳細資訊，請參閱 Prometheus Operator
E45003	BAD_NETWORK_CONNECTIVITY	您必須符合網路需求。
E45004	AZUREML_FE_ROLE_CONFLICT	舊版 AKS 中不支援 Azure Machine Learning 延伸模組。若要安裝 Azure Machine Learning 延伸模組，您必須刪除舊版 azureml-fe 元件。
E45005	AZUREML_FE_DEPLOYMENT_CONFLICT	舊版 AKS 中不支援 Azure Machine Learning 延伸模組。若要安裝 Azure Machine Learning 延伸模組，您必須執行此表下方的命令以刪除舊版 azureml-fe 元件。如需詳細資訊，請參閱此處。

用於刪除 AKS 叢集中舊版 azureml-fe 元件的命令如下：

kubectl delete sa azureml-fe
kubectl delete clusterrole azureml-fe-role
kubectl delete clusterrolebinding azureml-fe-binding
kubectl delete svc azureml-fe
kubectl delete svc azureml-fe-int-http
kubectl delete deploy azureml-fe
kubectl delete secret azuremlfessl
kubectl delete cm azuremlfeconfig

開放原始碼元件整合

Azure Machine Learning 延伸模組會使用一些開放原始碼元件，包括 Prometheus Operator、Volcano Scheduler 和 DCGM 匯出工具。如果 Kubernetes 叢集已安裝其中一些元件，則您可以閱讀下列各節，以將現有的元件與 Azure Machine Learning 延伸模組整合。

Prometheus Operator

Prometheus Operator 是開放原始碼架構，可協助在 kubernetes 中建置計量監視系統。 Azure Machine Learning 延伸模組也會利用 Prometheus Operator 來協助監視作業的資源使用率。

如果 Prometheus Operator 已由其他服務在叢集中安裝，您可以指定 installPromOp=false 以停用 Azure Machine Learning 延伸模組中的 Prometheus Operator，以避免兩個 Prometheus Operator 之間的衝突。在此情況下，現有的 Prometheus Operator 會管理所有 Prometheus 執行個體。在 Azure Machine Learning 延伸模組中停用 Prometheus Operator 時，請注意下列事項，以確保 Prometheus 正常運作。

檢查 azureml 命名空間中的 prometheus 是否由 Prometheus Operator 管理。在某些情況下，Prometheus Operator 會設定為僅監視某些特定命名空間。如果是，請確定 azureml 命名空間位於允許清單。如需詳細資訊，請參閱命令旗標。
檢查 Prometheus Operator 中是否已啟用 kubelet-service。 kubelet-service 包含 kubelet 的所有端點。如需詳細資訊，請參閱命令旗標。此外，也需要確定 kubelet-service 有標籤 k8s-app=kubelet。

建立 kubelet-service 的 ServiceMonitor。以已取代的變數執行下列命令：

cat << EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: prom-kubelet
  namespace: azureml
  labels:
    release: "<extension-name>"     # Please replace to your Azure Machine Learning extension name
spec:
  endpoints:
  - port: https-metrics
    scheme: https
    path: /metrics/cadvisor
    honorLabels: true
    tlsConfig:
      caFile: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
      insecureSkipVerify: true
    bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
    relabelings:
    - sourceLabels:
      - __metrics_path__
      targetLabel: metrics_path
  jobLabel: k8s-app
  namespaceSelector:
    matchNames:
    - "<namespace-of-your-kubelet-service>"  # Please change this to the same namespace of your kubelet-service
  selector:
    matchLabels:
      k8s-app: kubelet    # Please make sure your kubelet-service has a label named k8s-app and it's value is kubelet

EOF

DCGM 匯出工具

dcgm-exporter 是 NVIDIA 建議用來收集 GPU 計量的官方工具。此工具已整合至 Azure Machine Learning 延伸模組。但是，預設不會啟用 dcgm-exporter，而且不會收集 GPU 計量。您可以為 true 指定 installDcgmExporter 旗標以啟用。由於其是 NVIDIA 的官方工具，因此您可能已在 GPU 叢集中安裝此工具。若是如此，可以將 installDcgmExporter 設定為 false，並遵循下列步驟，將 dcgm-exporter 整合至 Azure Machine Learning 延伸模組。另一個要注意的是，dcgm-exporter 可讓使用者設定要公開的計量。針對 Azure Machine Learning 延伸模組，請確定 DCGM_FI_DEV_GPU_UTIL、DCGM_FI_DEV_FB_FREE 和 DCGM_FI_DEV_FB_USED 計量已公開。

確定您已成功安裝 Aureml 延伸模組和 dcgm-exporter。 dcgm-exporter 可以透過 dcgm-exporter Helm 圖表或 gpu-operator Helm 圖表安裝

檢查是否有 dcgm-exporter 的服務。如果該服務不存在，或您不知道如何檢查，請執行下列命令來建立該服務。

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: dcgm-exporter-service
  namespace: "<namespace-of-your-dcgm-exporter>" # Please change this to the same namespace of your dcgm-exporter
  labels:
    app.kubernetes.io/name: dcgm-exporter
    app.kubernetes.io/instance: "<extension-name>" # Please replace to your Azure Machine Learning extension name
    app.kubernetes.io/component: "dcgm-exporter"
  annotations:
    prometheus.io/scrape: 'true'
spec:
  type: "ClusterIP"
  ports:
  - name: "metrics"
    port: 9400  # Please replace to the correct port of your dcgm-exporter. It's 9400 by default
    targetPort: 9400  # Please replace to the correct port of your dcgm-exporter. It's 9400 by default
    protocol: TCP
  selector:
    app.kubernetes.io/name: dcgm-exporter  # Those two labels are used to select dcgm-exporter pods. You can change them according to the actual label on the service
    app.kubernetes.io/instance: "<dcgm-exporter-helm-chart-name>" # Please replace to the helm chart name of dcgm-exporter
EOF

檢查上一個步驟中的服務是否已正確設定

kubectl -n <namespace-of-your-dcgm-exporter> port-forward service/dcgm-exporter-service 9400:9400
# run this command in a separate terminal. You will get a lot of dcgm metrics with this command.
curl http://127.0.0.1:9400/metrics

設定 ServiceMonitor 以向 Azure Machine Learning 延伸模組公開 dcgm-exporter 服務。執行下列命令，而其將在幾分鐘內生效。

cat << EOF | kubectl apply -f -
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dcgm-exporter-monitor
  namespace: azureml
  labels:
    app.kubernetes.io/name: dcgm-exporter
    release: "<extension-name>"   # Please replace to your Azure Machine Learning extension name
    app.kubernetes.io/component: "dcgm-exporter"
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: dcgm-exporter
      app.kubernetes.io/instance: "<extension-name>"   # Please replace to your Azure Machine Learning extension name
      app.kubernetes.io/component: "dcgm-exporter"
  namespaceSelector:
    matchNames:
    - "<namespace-of-your-dcgm-exporter>"  # Please change this to the same namespace of your dcgm-exporter
  endpoints:
  - port: "metrics"
    path: "/metrics"
EOF

Volcano Scheduler

如果您的叢集已安裝 Volcano 套件，則可以設定 installVolcano=false，使得延伸模組不會安裝 Volcano Scheduler。定型作業提交和排程需要 Volcano Scheduler 和Volcano Controller。

Azure Machine Learning 延伸模組所使用的 Volcano Scheduler 設定如下：

volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
        - name: task-topology
        - name: priority
        - name: gang
        - name: conformance
    - plugins:
        - name: overcommit
        - name: drf
        - name: predicates
        - name: proportion
        - name: nodeorder
        - name: binpack

您需要使用與上述相同的組態設定，且如果您的 Volcano 版本低於 1.6，則必須在 Volcano 權限中停用 job/validate Webhook，Azure Machine Learning 訓練工作負載才能正常執行。

支援叢集自動調整程式的 Volcano Scheduler 整合

如同此討論串中所述，gang 外掛程式不適用於叢集自動調整程式 (CA)，以及 AKS 中的節點自動調整程式。

如果您透過設定 installVolcano=true，使用 Azure Machine Learning 延伸模組隨附的 Volcano，則根據預設，延伸模組會具有排程器設定，可設定 gang 外掛程式以避免作業鎖死。因此，由延伸模組安裝的 Volcano 不支援 AKS 叢集中的叢集自動調整程式 (CA)。

在此情況下，如果希望 AKS 叢集自動調整程式能正常工作，則可以更新延伸模組以設定此 volcanoScheduler.schedulerConfigMap 參數，並為其指定無 gang 的 Volcano Scheduler 自訂設定，例如：

volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: sla 
        arguments:
        sla-waiting-time: 1m
    - plugins:
      - name: conformance
    - plugins:
        - name: drf
        - name: predicates
        - name: proportion
        - name: nodeorder
        - name: binpack

若要在 AKS 從集中使用此設定，請遵循下列步驟：

在 azureml 命名空間中建立具有上述設定的 configmap 檔案。安裝 Azure Machine Learning 延伸模組時，通常會建立此命名空間。

在延伸模組設定中設定 volcanoScheduler.schedulerConfigMap=<configmap name> 以套用此 configmap。安裝延伸模組時，也應設定 amloperator.skipResourceValidation=true 以跳過資源驗證。例如：

az k8s-extension update --name <extension-name> --config volcanoScheduler.schedulerConfigMap=<configmap name> amloperator.skipResourceValidation=true --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name>

注意

由於移除了 gang 外掛程式，因此 Volcano 排程作業時可能會發生鎖死狀況。

若要避免這種情況，可以在不同作業使用相同的執行個體類型。

不支援使用 Azure 機器學習擴充功能所提供的預設值以外的排程器組態。請謹慎進行。

請注意，如果您的 Volcano 版本低於 1.6，則必須在 Volcano 權限中停用 job/validate Webhook。

Ingress Nginx 控制器

Azure Machine Learning 延伸模組安裝隨附預設為 k8s.io/ingress-nginx 的 Ingress Nginx 控制器類別。如果叢集中已經有 Ingress Nginx 控制器，則必須使用不同的控制器類別以避免安裝失敗。

您有兩個選擇：

請將現有的控制器類別變更為 k8s.io/ingress-nginx 以外的類別。
根據以下範例，使用不同的自訂控制器類別建立或更新 Azure Machine Learning 延伸模組。

例如，若要以控制器類別建立延伸模組：

az ml extension create --config nginxIngress.controller="k8s.io/amlarc-ingress-nginx"

若要以控制器類別更新延伸模組：

az ml extension update --config nginxIngress.controller="k8s.io/amlarc-ingress-nginx"

安裝 Azure Machine Learning 延伸模組的 Nginx 輸入控制器因記憶體不足 (OOM) 錯誤而損毀

徵兆

即使在沒有工作負載的情況下，安裝 Azure Machine Learning 延伸模組的 Nginx 輸入控制器仍因記憶體不足 (OOM) 錯誤而損毀。控制器記錄並未顯示任何可診斷問題的實用資訊。

可能的原因

若 Nginx 輸入控制器執行於具有多個 CPU 的節點，則可能會發生此問題。根據預設，Nginx 輸入控制器會根據 CPU 數目繁衍工作處理序，這會消耗更多資源，並導致具有更多 CPU 的節點出現 OOM 錯誤。這是已回報 GitHub 的已知問題

解決方法

若要解決此問題，您可以：

以參數 nginxIngress.controllerConfig.worker-processes=8 安裝延伸模組，藉此調整工作處理序的數目。
使用參數 nginxIngress.resources.controller.limits.memory=<new limit> 以提高記憶體上限。

請根據特定的節點規格和工作負載需求調整這兩個參數，以有效地最佳化工作負載。

分享方式：