Secure an Azure Machine Learning training environment with virtual networks (SDKv1)

APPLIES TO: Python SDK azureml v1

In this article, you learn how to secure training environments with a virtual network in Azure Machine Learning using the Python SDK v1.

Azure Machine Learning compute instance and compute cluster can be used to securely train models in a virtual network. When planning your environment, you can configure the compute instance/cluster with or without a public IP address. The general differences between the two are:

  • No public IP: Reduces costs as it doesn't have the same networking resource requirements. Improves security by removing the requirement for inbound traffic from the internet. However, there are additional configuration changes required to enable outbound access to required resources (Azure Active Directory, Azure Resource Manager, etc.).
  • Public IP: Works by default, but costs more due to additional Azure networking resources. Requires inbound communication from the Azure Machine Learning service over the public internet.

The following table contains the differences between these configurations:

Configuration With public IP Without public IP
Inbound traffic AzureMachineLearning None
Outbound traffic By default, can access the public internet with no restrictions.
You can restrict what it accesses using a Network Security Group or firewall.
By default, it cannot access the public internet since there is no public IP resource.
You need a Virtual Network NAT gateway or Firewall to route outbound traffic to required resources on the internet.
Azure networking resources Public IP address, load balancer, network interface None

You can also use Azure Databricks or HDInsight to train models in a virtual network.

Tip

For information on using the Azure Machine Learning studio and the Python SDK v2, see Secure training environment (v2).

For a tutorial on creating a secure workspace, see Tutorial: Create a secure workspace in Azure portal or Tutorial: Create a secure workspace using a template.

In this article you learn how to secure the following training compute resources in a virtual network:

  • Azure Machine Learning compute cluster
  • Azure Machine Learning compute instance
  • Azure Databricks
  • Virtual Machine
  • HDInsight cluster

Prerequisites

  • Read the Network security overview article to understand common virtual network scenarios and overall virtual network architecture.

  • An existing virtual network and subnet to use with your compute resources. This VNet must be in the same subscription as your Azure Machine Learning workspace.

    • We recommend putting the storage accounts used by your workspace and training jobs in the same Azure region that you plan to use for your compute instances and clusters. If they aren't in the same Azure region, you may incur data transfer costs and increased network latency.
    • Make sure that WebSocket communication is allowed to *.instances.azureml.net and *.instances.azureml.ms in your VNet. WebSockets are used by Jupyter on compute instances.
  • An existing subnet in the virtual network. This subnet is used when creating compute instances and clusters.

    • Make sure that the subnet isn't delegated to other Azure services.
    • Make sure that the subnet contains enough free IP addresses. Each compute instance requires one IP address. Each node within a compute cluster requires one IP address.
  • If you have your own DNS server, we recommend using DNS forwarding to resolve the fully qualified domain names (FQDN) of compute instances and clusters. For more information, see Use a custom DNS with Azure Machine Learning.

  • To deploy resources into a virtual network or subnet, your user account must have permissions to the following actions in Azure role-based access control (Azure RBAC):

    • "Microsoft.Network/virtualNetworks/*/read" on the virtual network resource. This permission isn't needed for Azure Resource Manager (ARM) template deployments.
    • "Microsoft.Network/virtualNetworks/subnet/join/action" on the subnet resource.

    For more information on Azure RBAC with networking, see the Networking built-in roles

Limitations

Azure Machine Learning compute cluster/instance

  • Compute clusters can be created in a different region than your workspace. This functionality is in preview, and is only available for compute clusters, not compute instances. When using a different region for the cluster, the following limitations apply:

    • If your workspace associated resources, such as storage, are in a different virtual network than the cluster, set up global virtual network peering between the networks. For more information, see Virtual network peering.
    • You may see increased network latency and data transfer costs. The latency and costs can occur when creating the cluster, and when running jobs on it.

    Guidance such as using NSG rules, user-defined routes, and input/output requirements, apply as normal when using a different region than the workspace.

    Warning

    If you are using a private endpoint-enabled workspace, creating the cluster in a different region is not supported.

  • Compute cluster/instance deployment in virtual network isn't supported with Azure Lighthouse.

Azure Databricks

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.
  • In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.
  • Azure Databricks doesn't use a private endpoint to communicate with the virtual network.

For more information on using Azure Databricks in a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Azure HDInsight or virtual machine

  • Azure Machine Learning supports only virtual machines that are running Ubuntu.

Compute instance/cluster with no public IP

Important

If you have been using compute instances or compute clusters configured for no public IP without opting-in to the preview, you will need to delete and recreate them after January 20 (when the feature is generally available).

If you were previously using the preview of no public IP, you may also need to modify what traffic you allow inbound and outbound, as the requirements have changed for general availability:

  • Outbound requirements - Two additional outbound, which are only used for the management of compute instances and clusters. The destination of these service tags are owned by Microsoft:
    • AzureMachineLearning service tag on UDP port 5831.
    • BatchNodeManagement service tag on TCP port 443.

The following configurations are in addition to those listed in the Prerequisites section, and are specific to creating a compute instances/clusters configured for no public IP:

  • Your workspace must use a private endpoint to connect to the VNet. For more information, see Configure a private endpoint for Azure Machine Learning workspace.

  • In your VNet, allow outbound traffic to the following service tags or fully qualified domain names (FQDN):

    Service tag Protocol Port Notes
    AzureMachineLearning TCP
    UDP
    443/8787/18881
    5831
    Communication with the Azure Machine Learning service.
    BatchNodeManagement.<region> ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch. Compute instance and compute cluster are implemented using the Azure Batch service.
    Storage.<region> TCP 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. This service tag is used to communicate with the Azure Storage account used by Azure Batch.

    Important

    The outbound access to Storage.<region> could potentially be used to exfiltrate data from your workspace. By using a Service Endpoint Policy, you can mitigate this vulnerability. For more information, see the Azure Machine Learning data exfiltration prevention article.

    FQDN Protocol Port Notes
    <region>.tundra.azureml.ms UDP 5831 Replace <region> with the Azure region that contains your Azure Machine learning workspace.
    graph.windows.net TCP 443 Communication with the Microsoft Graph API.
    *.instances.azureml.ms TCP 443/8787/18881 Communication with Azure Machine Learning.
    <region>.batch.azure.com ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch.
    <region>.service.batch.com ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch.
    *.blob.core.windows.net TCP 443 Communication with Azure Blob storage.
    *.queue.core.windows.net TCP 443 Communication with Azure Queue storage.
    *.table.core.windows.net TCP 443 Communication with Azure Table storage.
  • Create either a firewall and outbound rules or a NAT gateway and network service groups to allow outbound traffic. Since the compute has no public IP address, it can't communicate with resources on the public internet without this configuration. For example, it wouldn't be able to communicate with Azure Active Directory or Azure Resource Manager. Installing Python packages from public sources would also require this configuration.

    For more information on the outbound traffic that is used by Azure Machine Learning, see the following articles:

Use the following information to create a compute instance or cluster with no public IP address:

To create a compute instance or compute cluster with no public IP, use the Azure Machine Learning studio UI to create the resource:

  1. Sign in to the Azure Machine Learning studio, and then select your subscription and workspace.

  2. Select the Compute page from the left navigation bar.

  3. Select the + New from the navigation bar of compute instance or compute cluster.

  4. Configure the VM size and configuration you need, then select Next.

  5. From the Advanced Settings, Select Enable virtual network, your virtual network and subnet, and finally select the No Public IP option under the VNet/subnet section.

    A screenshot of how to configure no public IP for compute instance and compute cluster.

Tip

You can also use the Azure Machine Learning SDK v2 or Azure CLI extension for ML v2. For information on creating a compute instance or cluster with no public IP, see the v2 version of Secure an Azure Machine Learning training environment article.

Compute instance/cluster with public IP

The following configurations are in addition to those listed in the Prerequisites section, and are specific to creating compute instances/clusters that have a public IP:

  • If you put multiple compute instances/clusters in one virtual network, you may need to request a quota increase for one or more of your resources. The Machine Learning compute instance or cluster automatically allocates networking resources in the resource group that contains the virtual network. For each compute instance or cluster, the service allocates the following resources:

    • A network security group (NSG) is automatically created. This NSG allows inbound TCP traffic on port 44224 from the AzureMachineLearning service tag.

      Important

      Compute instance and compute cluster automatically create an NSG with the required rules.

      If you have another NSG at the subnet level, the rules in the subnet level NSG mustn't conflict with the rules in the automatically created NSG.

      To learn how the NSGs filter your network traffic, see How network security groups filter network traffic.

    • One load balancer

    For compute clusters, these resources are deleted every time the cluster scales down to 0 nodes and created when scaling up.

    For a compute instance, these resources are kept until the instance is deleted. Stopping the instance doesn't remove the resources.

    Important

    These resources are limited by the subscription's resource quotas. If the virtual network resource group is locked then deletion of compute cluster/instance will fail. Load balancer cannot be deleted until the compute cluster/instance is deleted. Also please ensure there is no Azure Policy assignment which prohibits creation of network security groups.

  • In your VNet, allow inbound TCP traffic on port 44224 from the AzureMachineLearning service tag.

    Important

    The compute instance/cluster is dynamically assigned an IP address when it is created. Since the address is not known before creation, and inbound access is required as part of the creation process, you cannot statically assign it on your firewall. Instead, if you are using a firewall with the VNet you must create a user-defined route to allow this inbound traffic.

  • In your VNet, allow outbound traffic to the following service tags:

    Service tag Protocol Port Notes
    AzureMachineLearning TCP
    UDP
    443/8787/18881
    5831
    Communication with the Azure Machine Learning service.
    BatchNodeManagement.<region> ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch. Compute instance and compute cluster are implemented using the Azure Batch service.
    Storage.<region> TCP 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. This service tag is used to communicate with the Azure Storage account used by Azure Batch.

    Important

    The outbound access to Storage.<region> could potentially be used to exfiltrate data from your workspace. By using a Service Endpoint Policy, you can mitigate this vulnerability. For more information, see the Azure Machine Learning data exfiltration prevention article.

    FQDN Protocol Port Notes
    <region>.tundra.azureml.ms UDP 5831 Replace <region> with the Azure region that contains your Azure Machine learning workspace.
    graph.windows.net TCP 443 Communication with the Microsoft Graph API.
    *.instances.azureml.ms TCP 443/8787/18881 Communication with Azure Machine Learning.
    <region>.batch.azure.com ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch.
    <region>.service.batch.com ANY 443 Replace <region> with the Azure region that contains your Azure Machine learning workspace. Communication with Azure Batch.
    *.blob.core.windows.net TCP 443 Communication with Azure Blob storage.
    *.queue.core.windows.net TCP 443 Communication with Azure Queue storage.
    *.table.core.windows.net TCP 443 Communication with Azure Table storage.

APPLIES TO: Python SDK azureml v1

import datetime
import time

from azureml.core.compute import ComputeTarget, ComputeInstance
from azureml.core.compute_target import ComputeTargetException

# Choose a name for your instance
# Compute instance name should be unique across the azure region
compute_name = "ci{}".format(ws._workspace_id)[:10]

# Verify that instance does not exist already
try:
    instance = ComputeInstance(workspace=ws, name=compute_name)
    print('Found existing instance, use it.')
except ComputeTargetException:
    compute_config = ComputeInstance.provisioning_configuration(
        vm_size='STANDARD_D3_V2',
        ssh_public_access=False,
        vnet_resourcegroup_name='vnet_resourcegroup_name',
        vnet_name='vnet_name',
        subnet_name='subnet_name',
        # admin_user_ssh_public_key='<my-sshkey>'
    )
    instance = ComputeInstance.create(ws, compute_name, compute_config)
    instance.wait_for_completion(show_output=True)

When the creation process finishes, you train your model. For more information, see Select and use a compute target for training.

Note

You may choose to use low-priority VMs to run some or all of your workloads. See how to create a low-priority VM.

Azure Databricks

  • The virtual network must be in the same subscription and region as the Azure Machine Learning workspace.
  • If the Azure Storage Account(s) for the workspace are also secured in a virtual network, they must be in the same virtual network as the Azure Databricks cluster.
  • In addition to the databricks-private and databricks-public subnets used by Azure Databricks, the default subnet created for the virtual network is also required.
  • Azure Databricks doesn't use a private endpoint to communicate with the virtual network.

For specific information on using Azure Databricks with a virtual network, see Deploy Azure Databricks in your Azure Virtual Network.

Required public internet access to train models

Important

While previous sections of this article describe configurations required to create compute resources, the configuration information in this section is required to use these resources to train models.

Azure Machine Learning requires both inbound and outbound access to the public internet. The following tables provide an overview of what access is required and what purpose it serves. For service tags that end in .region, replace region with the Azure region that contains your workspace. For example, Storage.westus:

Tip

The required tab lists the required inbound and outbound configuration. The situational tab lists optional inbound and outbound configurations required by specific configurations you may want to enable.

Direction Protocol &
ports
Service tag Purpose
Outbound TCP: 80, 443 AzureActiveDirectory Authentication using Azure AD.
Outbound TCP: 443, 18881
UDP: 5831
AzureMachineLearning Using Azure Machine Learning services.
Port 18881 is used for Python intellisense in notebooks.
Port 5831 is used to create, update, and delete Azure Machine Learning compute instance.
Outbound ANY: 443 BatchNodeManagement.region Communication with Azure Batch back-end for Azure Machine Learning compute instances/clusters.
Outbound TCP: 443 AzureResourceManager Creation of Azure resources with Azure Machine Learning, Azure CLI, and Azure Machine Learning SDK.
Outbound TCP: 443 Storage.region Access data stored in the Azure Storage Account for compute cluster and compute instance. This outbound can be used to exfiltrate data. For more information, see Data exfiltration protection.
Outbound TCP: 443 AzureFrontDoor.FrontEnd
* Not needed in Azure China.
Global entry point for Azure Machine Learning studio. Store images and environments for AutoML. This outbound can be used to exfiltrate data. For more information, see Data exfiltration protection.
Outbound TCP: 443 MicrosoftContainerRegistry.region
Note that this tag has a dependency on the AzureFrontDoor.FirstParty tag
Access docker images provided by Microsoft. Setup of the Azure Machine Learning router for Azure Kubernetes Service.

Tip

If you need the IP addresses instead of service tags, use one of the following options:

The IP addresses may change periodically.

Important

When using a compute cluster that is configured for no public IP address, you must allow the following traffic:

  • Inbound from source of VirtualNetwork and any port source, to destination of VirtualNetwork, and destination port of 29876, 29877.
  • Inbound from source AzureLoadBalancer and any port source to destination VirtualNetwork and port 44224 destination.

You may also need to allow outbound traffic to Visual Studio Code and non-Microsoft sites for the installation of packages required by your machine learning project. The following table lists commonly used repositories for machine learning:

Host name Purpose
anaconda.com
*.anaconda.com
Used to install default packages.
*.anaconda.org Used to get repo data.
pypi.org Used to list dependencies from the default index, if any, and the index isn't overwritten by user settings. If the index is overwritten, you must also allow *.pythonhosted.org.
cloud.r-project.org Used when installing CRAN packages for R development.
*pytorch.org Used by some examples based on PyTorch.
*.tensorflow.org Used by some examples based on Tensorflow.
code.visualstudio.com Required to download and install VS Code desktop. This is not required for VS Code Web.
update.code.visualstudio.com
*.vo.msecnd.net
Used to retrieve VS Code server bits that are installed on the compute instance through a setup script.
marketplace.visualstudio.com
vscode.blob.core.windows.net
*.gallerycdn.vsassets.io
Required to download and install VS Code extensions. These enable the remote connection to Compute Instances provided by the Azure ML extension for VS Code, see Connect to an Azure Machine Learning compute instance in Visual Studio Code for more information.
raw.githubusercontent.com/microsoft/vscode-tools-for-ai/master/azureml_remote_websocket_server/* Used to retrieve websocket server bits, which are installed on the compute instance. The websocket server is used to transmit requests from Visual Studio Code client (desktop application) to Visual Studio Code server running on the compute instance.

When using Azure Kubernetes Service (AKS) with Azure Machine Learning, allow the following traffic to the AKS VNet:

For information on using a firewall solution, see Use a firewall with Azure Machine Learning.

Next steps

This article is part of a series on securing an Azure Machine Learning workflow. See the other articles in this series: