在 AKS 或 Arc Kubernetes 叢集上部署 Azure Machine Learning 延伸項目

發行項
09/03/2024

若要讓 AKS 或 Arc Kubernetes 叢集能夠執行定型作業或推斷工作負載，您必須先在 AKS 或 Arc Kubernetes 叢集上部署 Azure Machine Learning 延伸模組。 Azure Machine Learning 延伸模組是以 AKS 的叢集延伸模組和 Arc Kubernetes 的叢集延伸模組為基礎，而其生命週期可透過 Azure CLI k8s 延伸模組輕鬆管理。

在本文中，您可了解：

必要條件
限制
檢閱 Azure Machine Learning 延伸項目組態集
Azure Machine Learning 延伸項目部署案例
驗證 Azure Machine Learning 延伸項目部署
檢閱 Azure Machine Learning 延伸項目元件
管理 Azure Machine Learning 延伸項目

必要條件

在 Azure 中執行的 AKS 叢集。如果您先前未使用叢集延伸模組，則需要註冊 KubernetesConfiguration 服務提供者。
或 Arc Kubernetes 叢集已啟動並執行。遵循將現有的 Kubernetes 叢集連線至 Azure Arc中的指示作業。
- 如果叢集是 Azure RedHat OpenShift (ARO) 服務叢集或 OpenShift Container Platform (OCP) 叢集，您必須滿足如設定 Kubernetes 叢集的參考一文所述的其他必要步驟。
基於生產目的，Kubernetes 叢集至少必須有 4 個 vCPU 核心和 14GB 記憶體。如需資源詳細資料和叢集大小建議的詳細資訊，請參閱建議的資源規劃。
在輸出 Proxy 伺服器或防火牆後方執行的叢集需要額外的網路設定。
安裝 Azure CLI 或升級至 2.24.0 版或更新版本。
安裝 Azure CLI 延伸模組 k8s-extension 或升級至 1.2.3 版或更新版本。

限制

Azure Machine Learning 不支援搭配 AKS 使用服務主體。 AKS 叢集必須改用受控識別。支援系統指派的受控識別和使用者指派的受控識別。如需詳細資訊，請參閱在 Azure Kubernetes Service 中使用受控識別。
- 當您的 AKS 叢集使用服務主體轉換為使用受控識別時，安裝擴充功能之前，所有節點集區都必須刪除並重新建立，而不是直接更新。
Azure Machine Learning 不支援停用 AKS 的本機帳戶。部署 AKS 叢集後，預設會啟用本機帳戶。
如果您的 AKS 叢集啟用了已授權的 IP 範圍來存取 API 伺服器，請為此 AKS 叢集啟用 Azure Machine Learning 控制平面 IP 範圍。 Azure Machine Learning 控制平面會部署到配對的區域。如果沒有 API 伺服器的存取權，則無法部署機器學習 Pod。在 AKS 叢集中啟用 IP 範圍時，請對這兩個配對區域使用該 IP 範圍。
Azure Machine Learning 不支援跨訂用帳戶連結 AKS 叢集。如果您在不同的訂用帳戶中有 AKS 叢集，您必須先將其連線到 Azure-Arc，然後在與 Azure Machine Learning 工作區相同的訂用帳戶中指定。
Azure Machine Learning 不保證支援 AKS 中的所有預覽階段功能。例如，不支援 Microsoft Entra Pod 身分識別。
如果您已遵循 Azure Machine Learning AKS v1 文件中的步驟來建立或連結您的 AKS 作為推斷叢集，請先使用下列連結來清除舊版 azureml-fe 相關資源，再繼續進行下一個步驟。

檢閱 Azure Machine Learning 延伸模組組態設定

您可以使用 Azure Machine Learning CLI 命令 k8s-extension create 來部署 Azure Machine Learning 延伸模組。 CLI k8s-extension create 可讓您使用 --config 或 --config-protected 參數，以 key=value 格式指定一組組態設定。以下是要在 Azure Machine Learning 延伸模組部署期間指定的可用組態設定清單。

組態設定值名稱	描述	訓練	推斷	定型和推斷
`enableTraining`	`True` 或 `False`，預設為 `False`。必須針對具有機器學習模型訓練和批次評分支援的 Azure Machine Learning 延伸模組部署設定為 `True`。	✓	N/A	✓
`enableInference`	`True` 或 `False`，預設為 `False`。必須針對具有機器學習推斷支援的 Azure Machine Learning 擴充功能部署設定為 `True`。	N/A	✓	✓
`allowInsecureConnections`	`True` 或 `False`，預設為 `False`。開發或測試目的可以設定為 `True` 以使用推斷 HTTP 端點。	N/A	選擇性	選擇性
`inferenceRouterServiceType`	`loadBalancer`、`nodePort` 或 `clusterIP`。若為 `enableInference=True` 則為必要。	N/A	✓	✓
`internalLoadBalancerProvider`	此設定目前僅適用於 Azure Kubernetes Service (AKS) 叢集。設為 `azure` 可使用內部負載平衡器以允許推斷路由器。	N/A	選擇性	選擇性
`sslSecret`	`azureml` 命名空間中的 Kubernetes 祕密名稱。此設定是用來儲存 `cert.pem` (PEM 編碼的 TLS/SSL 憑證) 和 `key.pem` (PEM 編碼的 TLS/SSL 金鑰)，當 `allowInsecureConnections` 設定為 `False` 時，需要上述項目才能獲得推斷 HTTPS 端點支援。如需 `sslSecret` 的範例 YAML 定義，請參閱設定 sslSecret。使用此設定或受 `sslCertPemFile` 和 `sslKeyPemFile` 組合保護的組態設定。	N/A	選擇性	選擇性
`sslCname`	推斷 HTTPS 端點會使用 TLS/SSL CNAME。若為 `allowInsecureConnections=False` 則為必要	N/A	選擇性	選擇性
`inferenceRouterHA`	`True` 或 `False`，預設為 `True`。 Azure Machine Learning 延伸模組預設會部署三個推斷路由器複本以獲得高可用性，而這要求叢集至少有三個背景工作節點。如果叢集的背景工作節點少於三個，請設為 `False`，如此只需要部署一項推斷路由器服務。	N/A	選擇性	選擇性
`nodeSelector`	根據預設，已部署 Kubernetes 資源和您的機器學習工作負載，會隨機部署到叢集的一或多個節點，而 DaemonSet 資源則會部署到「所有」節點。如果想要使用標籤 `key1=value1` 和 `key2=value2` 將延伸模組部署和您的定型/推斷工作負載限於特定節點，請相應使用 `nodeSelector.key1=value1` 和 `nodeSelector.key2=value2`。	選擇性	選擇性	選擇性
`installNvidiaDevicePlugin`	`True` 或 `False`，預設為 `False`。 NVIDIA GPU 硬體上的 ML 工作負載需要 NVIDIA 裝置外掛程式。根據預設，無論 Kubernetes 叢集是否有 GPU 硬體，Azure Machine Learning 延伸模組部署都不會安裝 NVIDIA 裝置外掛程式。使用者可以將此設定指定為 `True` 以安裝此外掛程式，但請務必滿足必要條件。	選擇性	選擇性	選擇性
`installPromOp`	`True` 或 `False`，預設為 `True`。 Azure Machine Learning 延伸模組需要 prometheus 運算子來管理 prometheus。設定為 `False`，以重複使用現有的 prometheus 運算子。如需重複使用現有 prometheus 運算子的詳細資訊，請參閱重複使用 prometheus 運算子	選擇性	選擇性	選擇性
`installVolcano`	`True` 或 `False`，預設為 `True`。 Azure Machine Learning 延伸模組需要 Volcano 排程器來排程作業。設定為 `False`，以重複使用現有的 Volcano 排程器。如需重複使用現有 Volcano Scheduler 的詳細資訊，請參閱重複使用 Volcano Scheduler	選擇性	N/A	選擇性
`installDcgmExporter`	`True` 或 `False`，預設為 `False`。 Dcgm-exporter 可以公開 Azure Machine Learning 工作負載的 GPU 計量，這可在 Azure 入口網站中監視。請將 `installDcgmExporter` 設為 `True` 以安裝 dcgm-exporter。但如果想要使用自己的 dcgm-exporter，請參閱 DCGM 匯出工具	選擇性	選擇性	選擇性

設定保護的設定金鑰名稱	描述	訓練	推斷	定型和推斷
`sslCertPemFile`, `sslKeyPemFile`	當 `allowInsecureConnections` 設定為 False 時，具有推斷 HTTPS 端點支援的 Azure Machine Learning 延伸模組部署，需要有 TLS/SSL 憑證和金鑰檔 (PEM 編碼) 的路徑。注意：不支援受複雜密碼保護的 PEM 檔案	N/A	選擇性	選擇性

如組態設定表所示，不同組態設定的組合可讓您針對不同的 ML 工作負載案例部署 Azure Machine Learning 延伸模組：

針對定型作業和批次推斷工作負載，指定 enableTraining=True
僅針對推斷工作負載，指定 enableInference=True
針對所有類型的 ML 工作負載，同時指定 enableTraining=True 和 enableInference=True

如果您打算為即時推斷工作負載部署 Azure Machine Learning 延伸模組，並想要指定 enableInference=True，請注意下列與即時推斷工作負載相關的組態設定：

即時推斷支援需要 azureml-fe 路由器服務，而且您必須針對 azureml-fe 指定 inferenceRouterServiceType 組態設定。 azureml-fe 可以使用下列其中一項 inferenceRouterServiceType 進行部署：
- 輸入 LoadBalancer。使用雲端提供者的負載平衡器在外部公開 azureml-fe。若要指定此值，請確定您的叢集支援負載平衡器佈建。請注意，大部分的內部部署 Kubernetes 叢集可能不支援外部負載平衡器。
- 輸入 NodePort。在靜態連接埠每個節點的 IP 上公開 azureml-fe。您可以要求 <NodeIP>:<NodePort>，從叢集外部連絡 azureml-fe。使用 NodePort 也可讓您為設定自己的負載平衡解決方案和 azureml-fe TLS/SSL 終止。
- 輸入 ClusterIP。在叢集內部 IP 上公開 azureml-fe，這樣可使 azureml-fe 只能從叢集內連線。若要讓 azureml-fe 提供叢集外部的推斷要求，則您必須設定自己的負載平衡解決方案和 azureml-fe 的 TLS/SSL 終止。
若要確保 azureml-fe 路由服務的高可用性，Azure Machine Learning 延伸模組部署預設會為具有三個節點以上的叢集建立三個 azureml-fe 複本。如果您的叢集少於 3 個節點，請設定 inferenceRouterHA=False。
您也想考慮使用 HTTPS 來限制對模型端點的存取，並保護用戶端提交的資料。為此目的，您必須指定 sslSecret 組態設定或 sslKeyPemFile 與 sslCertPemFile 組態保護設定的組合。
根據預設，Azure Machine Learning 延伸模組部署需要 HTTPS 支援的組態設定。為達開發或測試目的，透過組態設定 allowInsecureConnections=True 可方便提供 HTTP 支援。

Azure Machine Learning 延伸模組部署 - CLI 範例和 Azure 入口網站

Azure CLI
Azure 入口網站

若要使用 CLI 部署 Azure Machine Learning 延伸模組，請使用 az k8s-extension create 命令傳入必要參數值。

我們列出四個一般延伸模組的部署案例供您參考。若要為您的生產環境使用方式部署延伸模組，請仔細閱讀組態設定的完整清單。

在 Azure 中使用 AKS 叢集來快速取得證明概念以執行所有種類的 ML 工作負載，亦即執行定型工作或將模型部署為線上/批次端點

若為 AKS 叢集上的 Azure Machine Learning 延伸模組部署，--cluster-type 參數請務必指定 managedClusters 值。請執行下列 Azure CLI 命令以部署 Azure Machine Learning 延伸模組：
```
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer allowInsecureConnections=True InferenceRouterHA=False --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster
```
使用 Azure 外部的 Arc Kubernetes 叢集以快速取得證明概念，僅執行定型工作

若為 Arc Kubernetes 叢集上的 Azure Machine Learning 延伸模組部署，--cluster-type 參數必須指定 connectedClusters 值。請執行下列 Azure CLI 命令以部署 Azure Machine Learning 延伸模組：
```
az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <your-RG-name> --scope cluster
```

在 Azure 中啟用 AKS 叢集以進行生產定型和推斷工作負載 若為 AKS 上的 Azure Machine Learning 延伸模組部署，--cluster-type 參數請務必指定 managedClusters 值。假設您的叢集有三個以上的節點，而且您會使用 Azure 公用負載平衡器和 HTTPS 進行推斷工作負載支援。請執行下列 Azure CLI 命令以部署 Azure Machine Learning 延伸模組：

az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=LoadBalancer sslCname=<ssl cname> --config-protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type managedClusters --cluster-name <your-AKS-cluster-name> --resource-group <your-RG-name> --scope cluster

使用 NVIDIA GPU 在任何位置啟用 Arc Kubernetes 叢集以進行生產定型和推斷工作負載

若為 Arc Kubernetes 叢集上的 Azure Machine Learning 延伸模組部署，--cluster-type 參數請務必指定 connectedClusters 值。假設您的叢集有超過三個節點，而且您將使用 NodePort 服務類型和 HTTPS 來推斷工作負載支援，請執行下列 Azure CLI 命令來部署 Azure Machine Learning 延伸模組：

az k8s-extension create --name <extension-name> --extension-type Microsoft.AzureML.Kubernetes --config enableTraining=True enableInference=True inferenceRouterServiceType=NodePort sslCname=<ssl cname> installNvidiaDevicePlugin=True installDcgmExporter=True --config-protected sslCertPemFile=<file-path-to-cert-PEM> sslKeyPemFile=<file-path-to-cert-KEY> --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <your-RG-name> --scope cluster

驗證 Azure Machine Learning 延伸項目部署

執行下列 CLI 命令來檢查 Azure Machine Learning 延伸模組詳細資料：

az k8s-extension show --name <extension-name> --cluster-type connectedClusters --cluster-name <your-connected-cluster-name> --resource-group <resource-group>

在回應中，尋找 "name" 和 "provisioningState": "Succeeded"。請注意，系統可能會在前幾分鐘顯示 "provisioningState": "Pending"。
若 provisioningState 顯示 [已成功]，請在您的電腦上執行下列命令，並將 kubeconfig 檔案指向您的叢集，以檢查 "azureml" 命名空間下的所有 Pod 都處於 [執行中] 狀態：
```
 kubectl get pods -n azureml
```

檢閱 Azure Machine Learning 延伸模組元件

在 Azure Machine Learning 延伸模組部署完成時，您可使用 kubectl get deployments -n azureml 來查看叢集中建立的資源清單。其通常包含每個指定組態設定的下列資源子集。

資源名稱	資源類型	訓練	推斷	定型和推斷	描述	與雲端通訊
relayserver	Kubernetes 部署	✓	✓	✓	轉送伺服器只會針對 Arc Kubernetes 叢集建立，而不是在 AKS 叢集中建立。轉送伺服器可與 Azure 轉送搭配運作來與雲端服務通訊。	接收從雲端服務建立作業、部署模型的要求；與雲端服務同步作業狀態。
gateway	Kubernetes 部署	✓	✓	✓	閘道用於通訊及來回傳送資料。	將節點和叢集資源資訊傳送至雲端服務。
aml-operator	Kubernetes 部署	✓	N/A	✓	管理定型工作的生命週期。	使用雲端權杖服務進行權杖交換，以進行 Azure Container Registry 驗證和授權。
metrics-controller-manager	Kubernetes 部署	✓	✓	✓	管理 Prometheus 的設定	N/A
{EXTENSION-NAME}-kube-state-metrics	Kubernetes 部署	✓	✓	✓	將叢集相關計量匯出至 Prometheus。	N/A
{EXTENSION-NAME}-prometheus-operator	Kubernetes 部署	選擇性	選擇性	選擇性	提供 Prometheus 和相關監視元件的 Kubernetes 原生部署和管理。	N/A
amlarc-identity-controller	Kubernetes 部署	N/A	✓	✓	透過受控識別要求及更新 Azure Blob/Azure Container Registry 權杖。	使用雲端權杖服務進行權杖交換，以進行推斷/模型部署所使用的 Azure Container Registry 和 Azure Blob 的驗證和授權。
amlarc-identity-proxy	Kubernetes 部署	N/A	✓	✓	透過受控識別要求及更新 Azure Blob/Azure Container Registry 權杖。	使用雲端權杖服務進行權杖交換，以進行推斷/模型部署所使用的 Azure Container Registry 和 Azure Blob 的驗證和授權。
azureml-fe-v2	Kubernetes 部署	N/A	✓	✓	將傳入推斷要求路由至已部署服務的前端元件。	將服務記錄傳送至 Azure Blob。
inference-operator-controller-manager	Kubernetes 部署	N/A	✓	✓	管理推斷端點的生命週期。	N/A
volcano-admission	Kubernetes 部署	選擇性	N/A	選擇性	Volcano 許可 Webhook。	N/A
volcano-controllers	Kubernetes 部署	選擇性	N/A	選擇性	管理 Azure Machine Learning 定型作業 Pod 的生命週期。	N/A
volcano-scheduler	Kubernetes 部署	選擇性	N/A	選擇性	用於執行叢集內作業排程。	N/A
fluent-bit	Kubernetes Daemonset	✓	✓	✓	收集元件的系統記錄檔。	將元件的系統記錄檔上傳到雲端。
{EXTENSION-NAME}-dcgm-exporter	Kubernetes Daemonset	選擇性	選擇性	選擇性	dcgm-exporter 會公開 Prometheus 的 GPU 計量。	N/A
nvidia-device-plugin-daemonset	Kubernetes Daemonset	選擇性	選擇性	選擇性	nvidia-device-plugin-daemonset 會在叢集的每個節點上公開 GPU	N/A
prometheus-prom-prometheus	Kubernetes StatefulSet	✓	✓	✓	收集工作計量，並將其傳送至雲端。	將 cpu/gpu/memory 使用率等作業計量傳送至雲端。

重要

Azure 轉送資源位於與 Arc 叢集資源相同的資源群組之下。其用來與 Kubernetes 叢集進行通訊，修改這些資源將會中斷附加的計算目標。
根據預設，Kubernetes 部署資源會隨機部署到叢集的 1 個或多個節點，而 DaemonSet 資源會部署到所有節點。如果您想要將延伸模組部署限制為特定節點，請使用組態設定資料表中所述的 nodeSelector 組態設定。

注意

{EXTENSION-NAME} 是 az k8s-extension create --name CLI 命令所指定的擴充功能名稱。

管理 Azure Machine Learning 延伸項目

更新、列出、顯示與刪除 Azure Machine Learning 延伸模組。

針對未連線 Azure Arc 的 AKS 叢集，請參閱部署和管理叢集延伸模組。
針對已啟用 Azure Arc 的 Kubernetes，請參閱部署和管理已啟用 Azure Arc 的 Kubernetes 叢集延伸模組。

共用方式為