Create an Azure Machine Learning compute cluster with CLI v1
APPLIES TO: Azure CLI ml extension v1 Python SDK azureml v1
Learn how to create and manage a compute cluster in your Azure Machine Learning workspace.
You can use Azure Machine Learning compute cluster to distribute a training or batch inference process across a cluster of CPU or GPU compute nodes in the cloud. For more information on the VM sizes that include GPUs, see GPU-optimized virtual machine sizes.
In this article, learn how to:
- Create a compute cluster
- Lower your compute cluster cost
- Set up a managed identity for the cluster
Prerequisites
An Azure Machine Learning workspace. For more information, see Create an Azure Machine Learning workspace.
The Azure CLI extension for Machine Learning service (v1), Azure Machine Learning Python SDK, or the Azure Machine Learning Visual Studio Code extension.
Important
Some of the Azure CLI commands in this article use the
azure-cli-ml
, or v1, extension for Azure Machine Learning. Support for the v1 extension will end on September 30, 2025. You will be able to install and use the v1 extension until that date.We recommend that you transition to the
ml
, or v2, extension before September 30, 2025. For more information on the v2 extension, see Azure ML CLI extension and Python SDK v2.If using the Python SDK, set up your development environment with a workspace. Once your environment is set up, attach to the workspace in your Python script:
APPLIES TO: Python SDK azureml v1
from azureml.core import Workspace ws = Workspace.from_config()
What is a compute cluster?
Azure Machine Learning compute cluster is a managed-compute infrastructure that allows you to easily create a single or multi-node compute. The compute cluster is a resource that can be shared with other users in your workspace. The compute scales up automatically when a job is submitted, and can be put in an Azure Virtual Network. Compute cluster supports no public IP deployment as well in virtual network. The compute executes in a containerized environment and packages your model dependencies in a Docker container.
Compute clusters can run jobs securely in a virtual network environment, without requiring enterprises to open up SSH ports. The job executes in a containerized environment and packages your model dependencies in a Docker container.
Limitations
Compute clusters can be created in a different region and VNet than your workspace. However, this functionality is only available using the SDK v2, CLI v2, or studio. For more information, see the v2 version of secure training environments.
We currently support only creation (and not updating) of clusters through ARM templates. For updating compute, we recommend using the SDK, Azure CLI or UX for now.
Azure Machine Learning Compute has default limits, such as the number of cores that can be allocated. For more information, see Manage and request quotas for Azure resources.
Azure allows you to place locks on resources, so that they cannot be deleted or are read only. Do not apply resource locks to the resource group that contains your workspace. Applying a lock to the resource group that contains your workspace will prevent scaling operations for Azure Machine Learning compute clusters. For more information on locking resources, see Lock resources to prevent unexpected changes.
Tip
Clusters can generally scale up to 100 nodes as long as you have enough quota for the number of cores required. By default clusters are setup with inter-node communication enabled between the nodes of the cluster to support MPI jobs for example. However you can scale your clusters to 1000s of nodes by simply raising a support ticket, and requesting to allow list your subscription, or workspace, or a specific cluster for disabling inter-node communication.
Create
Time estimate: Approximately 5 minutes.
Azure Machine Learning Compute can be reused across runs. The compute can be shared with other users in the workspace and is retained between runs, automatically scaling nodes up or down based on the number of runs submitted, and the max_nodes set on your cluster. The min_nodes setting controls the minimum nodes available.
The dedicated cores per region per VM family quota and total regional quota, which applies to compute cluster creation, is unified and shared with Azure Machine Learning training compute instance quota.
Important
To avoid charges when no jobs are running, set the minimum nodes to 0. This setting allows Azure Machine Learning to de-allocate the nodes when they aren't in use. Any value larger than 0 will keep that number of nodes running, even if they are not in use.
The compute autoscales down to zero nodes when it isn't used. Dedicated VMs are created to run your jobs as needed.
To create a persistent Azure Machine Learning Compute resource in Python, specify the vm_size and max_nodes properties. Azure Machine Learning then uses smart defaults for the other properties.
- vm_size: The VM family of the nodes created by Azure Machine Learning Compute.
- max_nodes: The max number of nodes to autoscale up to when you run a job on Azure Machine Learning Compute.
APPLIES TO: Python SDK azureml v1
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
# Choose a name for your CPU cluster
cpu_cluster_name = "cpucluster"
# Verify that cluster does not exist already
try:
cpu_cluster = ComputeTarget(workspace=ws, name=cpu_cluster_name)
print('Found existing cluster, use it.')
except ComputeTargetException:
# To use a different region for the compute, add a location='<region>' parameter
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
max_nodes=4)
cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
cpu_cluster.wait_for_completion(show_output=True)
You can also configure several advanced properties when you create Azure Machine Learning Compute. The properties allow you to create a persistent cluster of fixed size, or within an existing Azure Virtual Network in your subscription. See the AmlCompute class for details.
Warning
When setting the location
parameter, if it is a different region than your workspace or datastores you may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.
Lower your compute cluster cost
You may also choose to use low-priority VMs to run some or all of your workloads. These VMs do not have guaranteed availability and may be preempted while in use. You will have to restart a preempted job.
APPLIES TO: Python SDK azureml v1
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2',
vm_priority='lowpriority',
max_nodes=4)
Set up managed identity
Azure Machine Learning compute clusters also support managed identities to authenticate access to Azure resources without including credentials in your code. There are two types of managed identities:
- A system-assigned managed identity is enabled directly on the Azure Machine Learning compute cluster and compute instance. The life cycle of a system-assigned identity is directly tied to the compute cluster or instance. If the compute cluster or instance is deleted, Azure automatically cleans up the credentials and the identity in Microsoft Entra ID.
- A user-assigned managed identity is a standalone Azure resource provided through Azure Managed Identity service. You can assign a user-assigned managed identity to multiple resources, and it persists for as long as you want. This managed identity needs to be created beforehand and then passed as the identity_id as a required parameter.
APPLIES TO: Python SDK azureml v1
Configure managed identity in your provisioning configuration:
System assigned managed identity created in a workspace named
ws
# configure cluster with a system-assigned managed identity compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=5, identity_type="SystemAssigned", ) cpu_cluster_name = "cpu-cluster" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
User-assigned managed identity created in a workspace named
ws
# configure cluster with a user-assigned managed identity compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=5, identity_type="UserAssigned", identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>']) cpu_cluster_name = "cpu-cluster" cpu_cluster = ComputeTarget.create(ws, cpu_cluster_name, compute_config)
Add managed identity to an existing compute cluster named
cpu_cluster
System-assigned managed identity:
# add a system-assigned managed identity cpu_cluster.add_identity(identity_type="SystemAssigned")
User-assigned managed identity:
# add a user-assigned managed identity cpu_cluster.add_identity(identity_type="UserAssigned", identity_id=['/subscriptions/<subcription_id>/resourcegroups/<resource_group>/providers/Microsoft.ManagedIdentity/userAssignedIdentities/<user_assigned_identity>'])
Note
Azure Machine Learning compute clusters support only one system-assigned identity or multiple user-assigned identities, not both concurrently.
Managed identity usage
The default managed identity is the system-assigned managed identity or the first user-assigned managed identity.
During a run there are two applications of an identity:
The system uses an identity to set up the user's storage mounts, container registry, and datastores.
- In this case, the system will use the default-managed identity.
The user applies an identity to access resources from within the code for a submitted run
- In this case, provide the client_id corresponding to the managed identity you want to use to retrieve a credential.
- Alternatively, get the user-assigned identity's client ID through the DEFAULT_IDENTITY_CLIENT_ID environment variable.
For example, to retrieve a token for a datastore with the default-managed identity:
client_id = os.environ.get('DEFAULT_IDENTITY_CLIENT_ID') credential = ManagedIdentityCredential(client_id=client_id) token = credential.get_token('https://storage.azure.com/')
Troubleshooting
There is a chance that some users who created their Azure Machine Learning workspace from the Azure portal before the GA release might not be able to create AmlCompute in that workspace. You can either raise a support request against the service or create a new workspace through the portal or the SDK to unblock yourself immediately.
Stuck at resizing
If your Azure Machine Learning compute cluster appears stuck at resizing (0 -> 0) for the node state, this may be caused by Azure resource locks.
Azure allows you to place locks on resources, so that they cannot be deleted or are read only. Locking a resource can lead to unexpected results. Some operations that don't seem to modify the resource actually require actions that are blocked by the lock.
With Azure Machine Learning, applying a delete lock to the resource group for your workspace will prevent scaling operations for Azure ML compute clusters. To work around this problem we recommend removing the lock from resource group and instead applying it to individual items in the group.
Important
Do not apply the lock to the following resources:
Resource name | Resource type |
---|---|
<GUID>-azurebatch-cloudservicenetworksecurityggroup |
Network security group |
<GUID>-azurebatch-cloudservicepublicip |
Public IP address |
<GUID>-azurebatch-cloudserviceloadbalancer |
Load balancer |
These resources are used to communicate with, and perform operations such as scaling on, the compute cluster. Removing the resource lock from these resources should allow autoscaling for your compute clusters.
For more information on resource locking, see Lock resources to prevent unexpected changes.
Next steps
Use your compute cluster to: