Secure an Azure Machine Learning training environment with virtual networks

APPLIES TO: Python SDK azure-ai-ml v2 (preview)

In this article, you learn how to secure training environments with a virtual network in Azure Machine Learning. You'll learn how to secure training environments through the Azure Machine Learning studio and Python SDK v2.

Tip

This article is part of a series on securing an Azure Machine Learning workflow. See the other articles in this series:

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace or Tutorial: Create a secure workspace using a template.

In this article you learn how to secure the following training compute resources in a virtual network:

  • Azure Machine Learning compute cluster
  • Azure Machine Learning compute instance
  • Azure Databricks
  • Virtual Machine
  • HDInsight cluster

Prerequisites

  • Read the Network security overview article to understand common virtual network scenarios and overall virtual network architecture.

  • An existing virtual network and subnet to use with your compute resources.

  • To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-based access control (Azure RBAC):

    • "Microsoft.Network/virtualNetworks/*/read" on the virtual network resource. This permission isn't needed for Azure Resource Manager (ARM) template deployments.
    • "Microsoft.Network/virtualNetworks/subnet/join/action" on the subnet resource.

    For more information on Azure RBAC with networking, see the Networking built-in roles

Azure Machine Learning compute cluster/instance

  • Compute clusters and instances create the following resources. If they're unable to create these resources (for example, if there's a resource lock on the resource group) then creation, scale out, or scale in, may fail.

    • IP address.
    • Network Security Group (NSG).
    • Load balancer.
  • The virtual network must be in the same subscription as the Azure Machine Learning workspace.

  • The subnet used for the compute instance or cluster must have enough unassigned IP addresses.

    • A compute cluster can dynamically scale. If there aren't enough unassigned IP addresses, the cluster will be partially allocated.
    • A compute instance only requires one IP address.
  • To create a compute cluster or instance without a public IP address (a preview feature), your workspace must use a private endpoint to connect to the VNet. For more information, see Configure a private endpoint for Azure Machine Learning workspace.

  • If you plan to secure the virtual network by restricting traffic, see the Required public internet access section.

  • The subnet used to deploy compute cluster/instance shouldn't be delegated to any other service. For example, it shouldn't be delegated to ACI.

Azure Databricks

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.

Limitations

Azure Machine Learning compute cluster/instance

  • If put multiple compute instances or clusters in one virtual network, you may need to request a quota increase for one or more of your resources. The Machine Learning compute instance or cluster automatically allocates networking resources in the resource group that contains the virtual network. For each compute instance or cluster, the service allocates the following resources:

    • One network security group (NSG). This NSG contains the following rules, which are specific to compute cluster and compute instance:

      Important

      Compute instance and compute cluster automatically create an NSG with the required rules.

      If you have another NSG at the subnet level, the rules in the subnet level NSG mustn't conflict with the rules in the automatically created NSG.

      To learn how the NSGs filter your network traffic, see How network security groups filter network traffic.

      • Allow inbound TCP traffic on ports 29876-29877 from the BatchNodeManagement service tag.
      • Allow inbound TCP traffic on port 44224 from the AzureMachineLearning service tag.

      The following screenshot shows an example of these rules:

      Screenshot of NSG

      Tip

      If your compute cluster or instance does not use a public IP address (a preview feature), these inbound NSG rules are not required.

    • For compute cluster or instance, it's now possible to remove the public IP address (a preview feature). If you have Azure Policy assignments prohibiting Public IP creation, then deployment of the compute cluster or instance will succeed.

    • One load balancer

    For compute clusters, these resources are deleted every time the cluster scales down to 0 nodes and created when scaling up.

    For a compute instance, these resources are kept until the instance is deleted. Stopping the instance doesn't remove the resources.

    Important

    These resources are limited by the subscription's resource quotas. If the virtual network resource group is locked then deletion of compute cluster/instance will fail. Load balancer cannot be deleted until the compute cluster/instance is deleted. Also please ensure there is no Azure Policy assignment which prohibits creation of network security groups.

  • If you create a compute instance and plan to use the no public IP address configuration, your Azure Machine Learning workspace's managed identity must be assigned the Reader role for the virtual network that contains the workspace. For more information on assigning roles, see Steps to assign an Azure role.

  • If you have configured Azure Container Registry for your workspace behind the virtual network, you must use a compute cluster to build Docker images. If you use a compute cluster configured for no public IP address, you must provide some method for the cluster to access the public internet. Internet access is required when accessing images stored on the Microsoft Container Registry, packages installed on Pypi, Conda, etc. For more information, see Enable Azure Container Registry.

  • If the Azure Storage Accounts for the workspace are also in the virtual network, use the following guidance on subnet limitations:

    • If you plan to use Azure Machine Learning studio to visualize data or use designer, the storage account must be in the same subnet as the compute instance or cluster.
    • If you plan to use the SDK, the storage account can be in a different subnet.

    Note

    Adding a resource instance for your workspace or selecting the checkbox for "Allow trusted Microsoft services to access this account" is not sufficient to allow communication from the compute.

  • When your workspace uses a private endpoint, the compute instance can only be accessed from inside the virtual network. If you use a custom DNS or hosts file, add an entry for <instance-name>.<region>.instances.azureml.ms. Map this entry to the private IP address of the workspace private endpoint. For more information, see the custom DNS article.

  • Virtual network service endpoint policies don't work for compute cluster/instance system storage accounts.

  • If storage and compute instance are in different regions, you may see intermittent timeouts.

  • If the Azure Container Registry for your workspace uses a private endpoint to connect to the virtual network, you can’t use a managed identity for the compute instance. To use a managed identity with the compute instance, don't put the container registry in the VNet.

  • If you want to use Jupyter Notebooks on a compute instance:

    • Don't disable websocket communication. Make sure your network allows websocket communication to *.instances.azureml.net and *.instances.azureml.ms.
    • Make sure that your notebook is running on a compute resource behind the same virtual network and subnet as your data. When creating the compute instance, use Advanced settings > Configure virtual network to select the network and subnet.
  • Compute clusters can be created in a different region than your workspace. This functionality is in preview, and is only available for compute clusters, not compute instances. When using a different region for the cluster, the following limitations apply:

    • If your workspace associated resources, such as storage, are in a different virtual network than the cluster, set up global virtual network peering between the networks. For more information, see Virtual network peering.
    • You may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.

    Guidance such as using NSG rules, user-defined routes, and input/output requirements, apply as normal when using a different region than the workspace.

    Warning

    If you are using a private endpoint-enabled workspace, creating the cluster in a different region is not supported.

  • An Azure Machine Learning workspace requires outbound access to storage.<region>/*.blob.core.windows.net on the public internet, where <region> is the Azure region of the workspace. This outbound access is required by Azure Machine Learning compute cluster and compute instance. Both are based on Azure Batch, and need to access a storage account provided by Azure Batch on the public network.

    By using a Service Endpoint Policy, you can mitigate this vulnerability. This feature is currently in preview. For more information, see the Azure Machine Learning data exfiltration prevention article.

Azure Databricks

  • In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.
  • Azure Databricks doesn't use a private endpoint to communicate with the virtual network.

For more information on using Azure Databricks in a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Azure HDInsight or virtual machine

  • Azure Machine Learning supports only virtual machines that are running Ubuntu.

Required public internet access

Azure Machine Learning requires both inbound and outbound access to the public internet. The following tables provide an overview of what access is required and what it is for. The protocol for all items is TCP. For service tags that end in .region, replace region with the Azure region that contains your workspace. For example, Storage.westus:

Direction Ports Service tag Purpose
Inbound 29876-29877 BatchNodeManagement Create, update, and delete of Azure Machine Learning compute instance and compute cluster. It isn't required if you use No Public IP option.
Inbound 44224 AzureMachineLearning Create, update, and delete of Azure Machine Learning compute instance. It isn't required if you use No Public IP option.
Outbound 80, 443 AzureActiveDirectory Authentication using Azure AD.
Outbound 443, 8787, 18881 AzureMachineLearning Using Azure Machine Learning services.
Outbound 443 AzureResourceManager Creation of Azure resources with Azure Machine Learning.
Outbound 443, 445 (*) Storage.region Access data stored in the Azure Storage Account for compute cluster and compute instance. This outbound can be used to exfiltrate data. For more information, see Data exfiltration protection.
(*) 445 is only required if you have a firewall between your virtual network for Azure ML and a private endpoint for your storage accounts.
Outbound 443 AzureFrontDoor.FrontEnd
* Not needed in Azure China.
Global entry point for Azure Machine Learning studio. Store images and environments for AutoML.
Outbound 443 MicrosoftContainerRegistry.region
Note that this tag has a dependency on the AzureFrontDoor.FirstParty tag
Access docker images provided by Microsoft. Setup of the Azure Machine Learning router for Azure Kubernetes Service.
Outbound 443 AzureMonitor Used to log monitoring and metrics to App Insights and Azure Monitor.
Outbound 443 Keyvault.region Access the key vault for the Azure Batch service. Only needed if your workspace was created with the hbi_workspace flag enabled.

Tip

If you need the IP addresses instead of service tags, use one of the following options:

The IP addresses may change periodically.

Important

When using a compute cluster that is configured for no public IP address, you must allow the following traffic:

  • Inbound from source of VirtualNetwork and any port source, to destination of VirtualNetwork, and destination port of 29876, 29877.
  • Inbound from source AzureLoadBalancer and any port source to destination VirtualNetwork and port 44224 destination.

You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft sites for the installation of packages required by your machine learning project. The following table lists commonly used repositories for machine learning:

Host name Purpose
anaconda.com
*.anaconda.com
Used to install default packages.
*.anaconda.org Used to get repo data.
pypi.org Used to list dependencies from the default index, if any, and the index isn't overwritten by user settings. If the index is overwritten, you must also allow *.pythonhosted.org.
cloud.r-project.org Used when installing CRAN packages for R development.
*pytorch.org Used by some examples based on PyTorch.
*.tensorflow.org Used by some examples based on Tensorflow.
code.visualstudio.com Required to download and install VS Code desktop. This is not required for VS Code Web.
update.code.visualstudio.com
*.vo.msecnd.net
Used to retrieve VS Code server bits that are installed on the compute instance through a setup script.
marketplace.visualstudio.com
vscode.blob.core.windows.net
*.gallerycdn.vsassets.io
Required to download and install VS Code extensions. These enable the remote connection to Compute Instances provided by the Azure ML extension for VS Code, see Connect to an Azure Machine Learning compute instance in Visual Studio Code for more information.
raw.githubusercontent.com/microsoft/vscode-tools-for-ai/master/azureml_remote_websocket_server/* Used to retrieve websocket server bits, which are installed on the compute instance. The websocket server is used to transmit requests from Visual Studio Code client (desktop application) to Visual Studio Code server running on the compute instance.

When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the following traffic to the AKS VNet:

For information on using a firewall solution, see Use a firewall with Azure Machine Learning.

Compute clusters

Use the following steps to create a compute cluster in the Azure Machine Learning studio:

  1. Sign in to Azure Machine Learning studio, and then select your subscription and workspace.

  2. Select Compute on the left, Compute clusters from the center, and then select + New.

    Screenshot of creating a cluster

  3. In the Create compute cluster dialog, select the VM size and configuration you need and then select Next.

    Screenshot of setting VM config

  4. From the Configure Settings section, set the Compute name, Virtual network, and Subnet.

    Screenshot shows setting compute name, virtual network, and subnet.

    Tip

    If your workspace uses a private endpoint to connect to the virtual network, the Virtual network selection field is greyed out.

  5. Select Create to create the compute cluster.

When the creation process finishes, you train your model by using the cluster in an experiment.

Note

You may choose to use low-priority VMs to run some or all of your workloads. See how to create a low-priority VM.

No public IP for compute clusters (preview)

When you enable No public IP, your compute cluster doesn't use a public IP for communication with any dependencies. Instead, it communicates solely within the virtual network using Azure Private Link ecosystem and service/private endpoints, eliminating the need for a public IP entirely. No public IP removes access and discoverability of compute cluster nodes from the internet thus eliminating a significant threat vector. No public IP clusters help comply with no public IP policies many enterprises have.

A compute cluster with No public IP enabled has no inbound communication requirements from public internet. Specifically, neither inbound NSG rule (BatchNodeManagement, AzureMachineLearning) is required. You still need to allow inbound from source of VirtualNetwork and any port source, to destination of VirtualNetwork, and destination port of 29876, 29877 and inbound from source AzureLoadBalancer and any port source to destination VirtualNetwork and port 44224 destination.

Warning

By default, you do not have public internet access from No Public IP Compute Cluster. This prevents outbound access to required resources such as Azure Active Directory, Azure Resource Manager, Microsoft Container Registry, and other outbound resources as listed in the Required public internet access section. Or to non-Microsoft resources such as Pypi or Conda repositories. To resolve this problem, you need to configure User Defined Routing (UDR) to reach to a public IP to access the internet. For example, you can use a public IP of your firewall, or you can use Virtual Network NAT with a public IP.

No public IP clusters are dependent on Azure Private Link for Azure Machine Learning workspace. A compute cluster with No public IP also requires you to disable private endpoint network policies and private link service network policies. These requirements come from Azure private link service and private endpoints and aren't Azure Machine Learning specific. Follow instruction from Disable network policies for Private Link service to set the parameters disable-private-endpoint-network-policies and disable-private-link-service-network-policies on the virtual network subnet.

For outbound connections to work, you need to set up an egress firewall such as Azure firewall with user defined routes. For instance, you can use a firewall set up with inbound/outbound configuration and route traffic there by defining a route table on the subnet in which the compute cluster is deployed. The route table entry can set up the next hop of the private IP address of the firewall with the address prefix of 0.0.0.0/0.

You can use a service endpoint or private endpoint for your Azure container registry and Azure storage in the subnet in which cluster is deployed.

To create a no public IP address compute cluster (a preview feature) in studio, set No public IP checkbox in the virtual network section. You can also create no public IP compute cluster through an ARM template. In the ARM template set enableNodePublicIP parameter to false.

Note

Support for compute instances without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, East US 2, North Europe, West Europe, Central US, North Central US, West US, Australia East, Japan East, Japan West.

Support for compute clusters without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, North Europe, East US 2, Central US, West Europe, North Central US, West US, Australia East, Japan East, Japan West.

Troubleshooting

  • If you get this error message during creation of cluster The specified subnet has PrivateLinkServiceNetworkPolicies or PrivateEndpointNetworkEndpoints enabled, follow the instructions from Disable network policies for Private Link service and Disable network policies for Private Endpoint.

  • If job execution fails with connection issues to ACR or Azure Storage, verify that customer has added ACR and Azure Storage service endpoint/private endpoints to subnet and ACR/Azure Storage allows the access from the subnet.

  • To ensure that you've created a no public IP cluster, in Studio when looking at cluster details you'll see No Public IP property is set to true under resource properties.

Compute instance

For steps on how to create a compute instance deployed in a virtual network, see Create and manage an Azure Machine Learning compute instance.

No public IP for compute instances (preview)

When you enable No public IP, your compute instance doesn't use a public IP for communication with any dependencies. Instead, it communicates solely within the virtual network using Azure Private Link ecosystem and service/private endpoints, eliminating the need for a public IP entirely. No public IP removes access and discoverability of compute instance node from the internet thus eliminating a significant threat vector. Compute instances will also do packet filtering to reject any traffic from outside virtual network. No public IP instances are dependent on Azure Private Link for Azure Machine Learning workspace.

Warning

By default, you do not have public internet access from No Public IP Compute Instance. You need to configure User Defined Routing (UDR) to reach to a public IP to access the internet. For example, you can use a public IP of your firewall, or you can use Virtual Network NAT with a public IP. Specifically, you need access to Azure Active Directory, Azure Resource Manager, Microsoft Container Registry, and other outbound resources as listed in the Required public internet access section. You may also need outbound access to non-Microsoft resources such as Pypi or Conda repositories.

For outbound connections to work, you need to set up an egress firewall such as Azure firewall with user defined routes. For instance, you can use a firewall set up with inbound/outbound configuration and route traffic there by defining a route table on the subnet in which the compute instance is deployed. The route table entry can set up the next hop of the private IP address of the firewall with the address prefix of 0.0.0.0/0.

A compute instance with No public IP enabled has no inbound communication requirements from public internet. Specifically, neither inbound NSG rule (BatchNodeManagement, AzureMachineLearning) is required. You still need to allow inbound from source of VirtualNetwork, any port source, destination of VirtualNetwork, and destination port of 29876, 29877, 44224.

A compute instance with No public IP also requires you to disable private endpoint network policies and private link service network policies. These requirements come from Azure private link service and private endpoints and aren't Azure Machine Learning specific. Follow instruction from Disable network policies for Private Link service source IP to set the parameters disable-private-endpoint-network-policies and disable-private-link-service-network-policies on the virtual network subnet.

To create a no public IP address compute instance (a preview feature) in studio, set No public IP checkbox in the virtual network section. You can also create no public IP compute instance through an ARM template. In the ARM template set enableNodePublicIP parameter to false.

Next steps:

Note

Support for compute instances without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, East US 2, North Europe, West Europe, Central US, North Central US, West US, Australia East, Japan East, Japan West.

Support for compute clusters without public IP addresses is currently available and in public preview for the following regions: France Central, East Asia, West Central US, South Central US, West US 2, East US, North Europe, East US 2, Central US, West Europe, North Central US, West US, Australia East, Japan East, Japan West.

Inbound traffic

When using Azure Machine Learning compute instance (with a public IP) or compute cluster, allow inbound traffic from Azure Batch management and Azure Machine Learning services. Compute instance with no public IP (preview) does not require this inbound communication. A Network Security Group allowing this traffic is dynamically created for you, however you may need to also create user-defined routes (UDR) if you have a firewall. When creating a UDR for this traffic, you can use either IP Addresses or service tags to route the traffic.

Important

Using service tags with user-defined routes is now GA. For more information, see Virtual Network routing.

Tip

While a compute instance without a public IP (a preview feature) does not need a UDR for this inbound traffic, you will still need these UDRs if you also use a compute cluster or a compute instance with a public IP.

For the Azure Machine Learning service, you must add the IP address of both the primary and secondary regions. To find the secondary region, see the Cross-region replication in Azure. For example, if your Azure Machine Learning service is in East US 2, the secondary region is Central US.

To get a list of IP addresses of the Batch service and Azure Machine Learning service, download the Azure IP Ranges and Service Tags and search the file for BatchNodeManagement.<region> and AzureMachineLearning.<region>, where <region> is your Azure region.

Important

The IP addresses may change over time.

When creating the UDR, set the Next hop type to Internet. This means the inbound communication from Azure skips your firewall to access the load balancers with public IPs of Compute Instance and Compute Cluster. UDR is required because Compute Instance and Compute Cluster will get random public IPs at creation, and you cannot know the public IPs before creation to register them on your firewall to allow the inbound from Azure to specific IPs for Compute Instance and Compute Cluster. The following image shows an example IP address based UDR in the Azure portal:

Image of a user-defined route configuration

For information on configuring UDR, see Route network traffic with a routing table.

For more information on input and output traffic requirements for Azure Machine Learning, see Use a workspace behind a firewall.

Azure Databricks

For specific information on using Azure Databricks with a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Virtual machine or HDInsight cluster

In this section, you learn how to use a virtual machine or Azure HDInsight cluster in a virtual network with your workspace.

Create the VM or HDInsight cluster

Create a VM or HDInsight cluster by using the Azure portal or the Azure CLI, and put the cluster in an Azure virtual network. For more information, see the following articles:

Configure network ports

Allow Azure Machine Learning to communicate with the SSH port on the VM or cluster, configure a source entry for the network security group. The SSH port is usually port 22. To allow traffic from this source, do the following actions:

  1. In the Source drop-down list, select Service Tag.

  2. In the Source service tag drop-down list, select AzureMachineLearning.

    Inbound rules for doing experimentation on a VM or HDInsight cluster within a virtual network

  3. In the Source port ranges drop-down list, select *.

  4. In the Destination drop-down list, select Any.

  5. In the Destination port ranges drop-down list, select 22.

  6. Under Protocol, select Any.

  7. Under Action, select Allow.

Keep the default outbound rules for the network security group. For more information, see the default security rules in Security groups.

If you don't want to use the default outbound rules and you do want to limit the outbound access of your virtual network, see the required public internet access section.

Attach the VM or HDInsight cluster

Attach the VM or HDInsight cluster to your Azure Machine Learning workspace. For more information, see Manage compute resources for model training and deployment in studio.

Next steps

This article is part of a series on securing an Azure Machine Learning workflow. See the other articles in this series: