使用 CLI 第 1 版建立 Azure Machine Learning 計算叢集

2025-06-30

適用於：Azure CLI ml 延伸模組 v1 Python SDK azureml v1

重要

本文提供使用 Azure Machine Learning SDK v1 的相關信息。 SDK v1 自 2025 年 3 月 31 日起已被取代。其支援將於 2026 年 6 月 30 日結束。您可以在該日期之前安裝並使用 SDK v1。

建議您在 2026 年 6 月 30 日之前轉換至 SDK v2。如需 SDK v2 的詳細資訊，請參閱什麼是 Azure Machine Learning CLI 和 Python SDK v2？和 SDK v2 參考。

了解如何在您的 Azure Machine Learning 工作區中建立和管理計算叢集。

您可以使用 Azure Machine Learning 計算叢集，將定型或批次推斷程序散發到雲端中 CPU 或 GPU 計算節點的叢集。如需包含 GPU 的 VM 大小有關的詳細資訊，請參閱 GPU 最佳化虛擬機器大小。

在本文中，了解如何：

建立計算叢集
降低您的計算叢集成本
為叢集設定受控識別

必要條件

Azure Machine Learning 工作區。如需詳細資訊，請參閱建立 Azure Machine Learning 工作區。
適用於 Machine Learning 服務的 Azure CLI 延伸模組 (v1)、Azure Machine Learning Python SDK 或 Azure Machine Learning Visual Studio Code 延伸模組。

重要

本文中的 Azure CLI 命令使用 azure-cli-ml 或 v1 (Azure Machine Learning 的擴充功能)。 v1 擴充功能的支援將於 2025 年 9 月 30 日終止。您能夠安裝並使用 v1 延伸模組，直到該日期為止。

建議您在 2025 年 9 月 30 日之前轉換至 ml 或 v2 擴充功能。如需 v2 擴充功能的詳細資訊，請參閱 Azure 機器學習 CLI 擴充功能和 Python SDK v2。
如果使用 Python SDK，請使用工作區設定您的開發環境。設定環境之後，連結至您的 Python 指令碼中的工作區：

適用於：適用於 Python 的 Azure Machine Learning SDK v1
```
from azureml.core import Workspace

ws = Workspace.from_config() 
```

什麼是計算叢集？

Azure Machine Learning 計算叢集是一種受控的計算基礎結構，可讓您輕鬆建立單一或多重節點計算。計算叢集是可以與您工作區中的其他使用者共用的資源。計算會在提交作業時自動相應增加，而且可以放在 Azure 虛擬網路中。計算叢集在虛擬網路中也不支援公用 IP 部署。計算會在容器化環境中執行，並在 Docker 容器中封裝模型的相依性。

計算叢集可在虛擬網路環境中安全地執行作業，而無須企業開啟 SSH 連接埠。作業會在容器化環境中執行，並在 Docker 容器中封裝模型的相依性。

限制

您可以在工作區以外的不同區域和 VNet 中建立計算叢集。不過，此功能僅適用於 SDK v2、CLI v2 或工作室。如需詳細資訊，請參閱 v2 版的安全定型環境。
我們目前僅支援透過 ARM 範本來建立 (而不是更新) 叢集。若要更新計算，目前建議您使用 SDK、Azure CLI 或 UX。
Azure Machine Learning Compute 有預設限制，例如可配置的核心數目。如需詳細資訊，請參閱管理和要求 Azure 資源的配額。
Azure 可讓您對資源施加鎖定，使其無法被刪除，或處於唯讀狀態。 請勿將資源鎖定套用至包含您工作區的資源群組。將鎖定套用至包含您工作區的資源群組，將會防止 Azure Machine Learning 計算叢集的調整作業。如需鎖定資源的詳細資訊，請參閱鎖定資源以防止非預期的變更。

提示

只要有足夠的配額可滿足所需的核心數目，叢集一般可以擴大為 100 個節點。例如，叢集預設會設定為已在叢集節點之間啟用節點間通訊，以便支援 MPI 作業。不過，您也可以將叢集擴大為數千個節點，只要提出支援票證，並要求將訂用帳戶 (或工作區) 列入允許清單，或要求特定叢集以停用節點間通訊即可。

建立

估計時間：約 5 分鐘。

Azure Machine Learning Compute 可以跨回合重複使用。計算可與工作區中的其他使用者共用，並在回合之間保留，且會根據所提交的回合數目以及叢集上設定的 max_nodes 自動擴大或縮小節點。 min_nodes 設定可以控制可用的節點數目下限。

適用於計算叢集建立的專用核心每個區域、VM 系列配額與總計區域配額會統一，並與 Azure Machine Learning 定型計算叢集配額共用。

重要

若要避免在未執行作業時產生費用，請將節點數下限設定為 0。這項設定可讓 Azure Machine Learning 將未使用的節點解除配置。任何大於 0 的值都會保持執行該數量的節點，即使不使用節點也一樣。

未使用時，計算會自動向下調整為零節點。視需要建立專用的虛擬機器以執行您的作業。

Python SDK
Azure CLI

若要使用 Python 建立持續性 Azure Machine Learning Compute 資源，請指定 vm_size 和 max_nodes 屬性。 Azure Machine Learning 接著會對於其他屬性使用智慧型預設值。

vm_size：Azure Machine Learning Compute 建立的 VM 系列節點。
max_nodes：在 Azure Machine Learning Compute 上執行作業時，自動向上調整的最大節點數。

適用於：適用於 Python 的 Azure Machine Learning SDK v1

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"

# Verify that cluster does not exist already
try:
    cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    # To use a different region for the compute, add a location='<region>' parameter
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                           max_nodes=4)
    cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

cpu_cluster.wait_for_completion(show_output=True)

建立 Azure Machine Learning Compute 時，您也可以設定多個進階屬性。這些屬性可讓您建立固定大小的持續性叢集，也可以在您訂用帳戶中現有的 Azure 虛擬網路內建立。如需詳細資料，請參閱 AmlCompute 類別。

警告

設定 location 參數時，如果其在與您的工作區或資料存放區不同的區域，您可能會看到網路延遲和資料傳輸成本增加。建立叢集以及在叢集上執行作業時，可能會產生延遲和成本。

適用於：Azure CLI ml 延伸模組 v1

az ml computetarget create amlcompute -n cpu --min-nodes 1 --max-nodes 1 -s STANDARD_D3_V2 --location westus2

警告

在與您的工作區或資料存放區不同區域中使用計算叢集時，您可能會看到網路延遲和資料傳輸成本增加。建立叢集以及在叢集上執行作業時，可能會產生延遲和成本。

如需詳細資訊，請參閱 Az PowerShell 模組 az ml computetarget create amlcompute。

降低您的計算叢集成本

您也可以選擇使用低優先順序的 VM 來執行部分或所有的工作負載。這些 VM 沒有保證可用性，可能會在使用時被佔用。您將必須重新啟動先佔作業。

Python SDK
Azure CLI

適用於：適用於 Python 的 Azure Machine Learning SDK v1

compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                            vm_priority='lowpriority',
                                                            max_nodes=4)

適用於：Azure CLI ml 延伸模組 v1

設定 vm-priority：

az ml computetarget create amlcompute --name lowpriocluster --vm-size Standard_NC6 --max-nodes 5 --vm-priority lowpriority

設定受控身分識別

Azure Machine Learning 計算叢集也支援受控身分識別，以驗證對 Azure 資源的存取，而不在您的程式碼中包含認證。受控身分識別有兩種：

系統指派的受控識別會直接在 Azure Machine Learning 計算叢集和計算執行個體上啟用。系統指派的身分識別生命週期會直接繫結至計算叢集或執行個體。若已刪除計算叢集或執行個體，Azure 會自動清除 Microsoft Entra ID 中的認證和身分識別。
使用者指派的受控身分識別是透過 Azure 受控識別服務提供的獨立 Azure 資源。您可以將使用者指派的受控身分識別指派給多個資源，且可視需要持續保存。您必須事先建立此受控身分識別，然後以 identity_id 做為必要參數傳遞。

Python SDK
Azure CLI

適用於：適用於 Python 的 Azure Machine Learning SDK v1

在佈建設定中設定受控識別：

在名為 ws 的工作區中建立系統使用者指派的受控識別

# configure cluster with a system-assigned managed identity
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                        max_nodes=5,
                                                        identity_type="SystemAssigned",
                                                        )
cpu_cluster_name = "cpu-cluster"
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

在名為 ws 的工作區中建立使用者指派的受控識別

# configure cluster with a user-assigned managed identity
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
                                                        max_nodes=5,
                                                        identity_type="UserAssigned",
                                                        identity_id=['/subscriptions/<subscription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])

cpu_cluster_name = "cpu-cluster"
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

將受控識別新增至名為 cpu_cluster 的現有計算叢集

系統指派的受控識別：

# add a system-assigned managed identity
cpu_cluster.add_identity(identity_type="SystemAssigned")

使用者指派的受控識別：

# add a user-assigned managed identity
cpu_cluster.add_identity(identity_type="UserAssigned", 
                            identity_id=['/subscriptions/<subscription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])

適用於：Azure CLI ml 延伸模組 v1

使用受控識別建立新的受控計算叢集

使用者指派的受控識別

az ml computetarget create amlcompute --name cpu-cluster --vm-size Standard_NC6 --max-nodes 5 --assign-identity '/subscriptions/<subscription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'

系統指派的受控識別

az ml computetarget create amlcompute --name cpu-cluster --vm-size Standard_NC6 --max-nodes 5 --assign-identity '[system]'

將受控識別新增至現有的叢集：

使用者指派的受控識別

az ml computetarget amlcompute identity assign --name cpu-cluster '/subscriptions/<subscription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'

系統指派的受控識別

az ml computetarget amlcompute identity assign --name cpu-cluster '[system]'

注意

Azure Machine Learning 計算叢集僅支援一個系統指派的身分識別，或多個使用者指派的身分識別，無法同時支援兩者。

受控識別使用方式

預設受控身分識別是系統指派的受控身分識別，或第一個使用者指派的受控身分識別。

在執行期間，有兩種身分識別的應用程式：

系統會使用身分識別來設定使用者的儲存體裝載、容器登錄和資料存放區。
- 在此情況下，系統會使用預設的受控識別。
使用者套用身分識別，為已提交的執行從程式碼存取資源
- 在此情況下，請針對您要用來擷取認證的受控身分識別，提供對應的 client_id。
- 或是透過 DEFAULT_IDENTITY_CLIENT_ID 環境變數，取得使用者指派身分識別的用戶端識別碼。
例如，若要使用預設受控身分識別來取得資料存放區的權杖：
```
client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID')
credential = ManagedIdentityCredential(client_id=client_id)
token = credential.get_token('https://storage.azure.com/')
```

疑難排解

在 GA 發行之前就從 Azure 入口網站建立 Azure Machine Learning 工作區的部分使用者，可能會無法在該工作區上建立 AmlCompute。您可以對該服務提出支援要求，或透過入口網站或 SDK 來建立新的工作區，以立即自行解除鎖定。

停滯在調整大小

如果您的 Azure Machine Learning 計算叢集在調整大小時出現停滯 (0-> 0) 的節點狀態，這可能是因為 Azure 資源鎖定所致。

Azure 可讓您對資源施加鎖定，使其無法被刪除，或處於唯讀狀態。 鎖定資源可能會導致非預期的結果。 某些看似不會修改資源的作業，實際上需要會被鎖定封鎖的動作。

有了 Azure Machine Learning，將刪除鎖定套用至工作區的資源群組，將會防止 Azure ML 計算叢集的調整作業。若要解決此問題，建議您從資源群組中移除鎖定，並改為將其套用至群組中的個別項目。

重要

請勿將鎖定套用至下列資源：

資源名稱	資源類型
`<GUID>-azurebatch-cloudservicenetworksecurityggroup`	網路安全性群組
`<GUID>-azurebatch-cloudservicepublicip`	公用 IP 位址
`<GUID>-azurebatch-cloudserviceloadbalancer`	負載平衡器

這些資源是用來與計算叢集進行通訊及執行調整規模等作業。從這些資源中移除資源鎖定應該會允許您的計算叢集進行自動調整。

如需資源鎖定的詳細資訊，請參閱鎖定資源以防止非預期的變更。

下一步

使用您的計算叢集，以便：

共用方式為

使用 CLI 第 1 版建立 Azure Machine Learning 計算叢集

必要條件

什麼是計算叢集？

限制

建立​​

降低您的計算叢集成本

設定受控身分識別

受控識別使用方式

疑難排解

停滯在調整大小

下一步

意見反應

其他資源

建立