Secure your managed online endpoints with network isolation

Artikkel
08/28/2024

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you'll use network isolation to secure a managed online endpoint. You'll create a managed online endpoint that uses an Azure Machine Learning workspace's private endpoint for secure inbound communication. You'll also configure the workspace with a managed virtual network that allows only approved outbound communication for deployments. Finally, you'll create a deployment that uses the private endpoints of the workspace's managed virtual network for outbound communication.

For examples that use the legacy method for network isolation, see the deployment files deploy-moe-vnet-legacy.sh (for deployment using a generic model) and deploy-moe-vnet-mlflow-legacy.sh (for deployment using an MLflow model) in the azureml-examples GitHub repo.

Prerequisites

To use Azure Machine Learning, you must have an Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the free or paid version of Azure Machine Learning today.
Install and configure the Azure CLI and the ml extension to the Azure CLI. For more information, see Install, set up, and use the CLI (v2).
Tip

Azure Machine Learning managed virtual network was introduced on May 23rd, 2023. If you have an older version of the ml extension, you might need to update it for the examples in this article to work. To update the extension, use the following Azure CLI command:
```
az extension update -n ml
```
The CLI examples in this article assume that you're using the Bash (or compatible) shell. For example, from a Linux system or Windows Subsystem for Linux.
You must have an Azure Resource Group, in which you (or the service principal you use) need to have Contributor access. You'll have such a resource group if you've configured your ml extension.
If you want to use a user-assigned managed identity to create and manage online endpoints and online deployments, the identity should have the proper permissions. For details about the required permissions, see Set up service authentication. For example, you need to assign the proper RBAC permission for Azure Key Vault on the identity.

Migrate from legacy network isolation method to workspace managed virtual network

If you've used the legacy method previously for network isolation of managed online endpoints, and you want to migrate to using a workspace managed virtual network to secure your endpoints, you can follow these steps:

Create a new workspace and enable managed virtual network. For more information on how to configure a managed network for your workspace, see Workspace Managed Virtual Network Isolation.
(Optional) On the workspace network setting, add outbound rules with the type of private endpoints if your deployments need to access additional private resources, other than Storage account, Azure Key Vault, and Azure Container Registry (ACR) associated with the workspace (which are added by default).
(Optional) If you intend to use Azure Machine Learning registries, configure private endpoints for outbound communication to your registry, its storage account, and its Azure Container Registry.
Create online endpoints / deployments in the new workspace. You may leverage Azure Machine Learning registries to directly deploy from them. For more information, see Deploy from Registry.
Update applications invoking endpoints to use the scoring URIs of the new online endpoints.
Delete online endpoints from old workspace after validation.

If you don't need to maintain computes or keep online endpoints and deployments in the old workspace to serve without downtime, you can simply delete all computes in the existing workspace, and update the workspace to enable workspace managed virtual network.

Limitations

The v1_legacy_mode flag must be disabled (false) on your Azure Machine Learning workspace. If this flag is enabled, you won't be able to create a managed online endpoint. For more information, see Network isolation with v2 API.
If your Azure Machine Learning workspace has a private endpoint that was created before May 24, 2022, you must recreate the workspace's private endpoint before configuring your online endpoints to use a private endpoint. For more information on creating a private endpoint for your workspace, see How to configure a private endpoint for Azure Machine Learning workspace.

Tip

To confirm when a workspace was created, you can check the workspace properties.

In the Studio, go to the Directory + Subscription + Workspace section (top right of the Studio) and select View all properties in Azure Portal. Select the JSON view from the top right of the "Overview" page, then choose the latest API version. From this page, you can check the value of properties.creationTime.

Alternatively, use az ml workspace show with CLI, my_ml_client.workspace.get("my-workspace-name") with SDK, or curl on a workspace with REST API.
When you use network isolation with online endpoints, you can use workspace-associated resources (Azure Container Registry (ACR), Storage account, Key Vault, and Application Insights) from a different resource group than that of your workspace. However, these resources must belong to the same subscription and tenant as your workspace.

Note

Network isolation described in this article applies to data plane operations, that is, operations that result from scoring requests (or model serving). Control plane operations (such as requests to create, update, delete, or retrieve authentication keys) are sent to the Azure Resource Manager over the public network.

Prepare your system

Create the environment variables used by this example by running the following commands. Replace <YOUR_WORKSPACE_NAME> with the name to use for your workspace. Replace <YOUR_RESOURCEGROUP_NAME> with the resource group that will contain your workspace.

Tip

before creating a new workspace, you must create an Azure Resource Group to contain it. For more information, see Manage Azure Resource Groups.
```
export RESOURCEGROUP_NAME="<YOUR_RESOURCEGROUP_NAME>"
export WORKSPACE_NAME="<YOUR_WORKSPACE_NAME>"
```
Create your workspace. The -m allow_only_approved_outbound parameter configures a managed virtual network for the workspace and blocks outbound traffic except to approved destinations.
```
az ml workspace create -g $RESOURCEGROUP_NAME -n $WORKSPACE_NAME -m allow_only_approved_outbound
```
Alternatively, if you'd like to allow the deployment to send outbound traffic to the internet, uncomment the following code and run it instead.
```
# az ml workspace create -g $RESOURCEGROUP_NAME -n $WORKSPACE_NAME -m allow_internet_outbound
```
For more information on how to create a new workspace or to upgrade your existing workspace to use a manged virtual network, see Configure a managed virtual network to allow internet outbound.

When the workspace is configured with a private endpoint, the Azure Container Registry for the workspace must be configured for Premium tier to allow access via the private endpoint. For more information, see Azure Container Registry service tiers. Also, the workspace should be set with the image_build_compute property, as deployment creation involves building of images. See Configure image builds for more.

Important

When workspace managed virtual network is set up for a workspace for the first time, the network is not provisioned yet. Before proceeding to create online deployments, provision the network by following the guideline Manually provision a managed network. Creating online deployments will be rejected until the managed network is provisioned.
Configure the defaults for the CLI so that you can avoid passing in the values for your workspace and resource group multiple times.
```
az configure --defaults workspace=$WORKSPACE_NAME group=$RESOURCEGROUP_NAME
```
Clone the examples repository to get the example files for the endpoint and deployment, then go to the repository's /cli directory.
```
git clone --depth 1 https://github.com/Azure/azureml-examples
cd /cli
```

The commands in this tutorial are in the file deploy-managed-online-endpoint-workspacevnet.sh in the cli directory, and the YAML configuration files are in the endpoints/online/managed/sample/ subdirectory.

Create a secured managed online endpoint

To create a secured managed online endpoint, create the endpoint in your workspace and set the endpoint's public_network_access to disabled to control inbound communication. The endpoint will then have to use the workspace's private endpoint for inbound communication.

Because the workspace is configured to have a managed virtual network, any deployments of the endpoint will use the private endpoints of the managed virtual network for outbound communication.

Set the endpoint's name.

export ENDPOINT_NAME="<YOUR_ENDPOINT_NAME>"

Create an endpoint with public_network_access disabled to block inbound traffic.
```
az ml online-endpoint create --name $ENDPOINT_NAME -f endpoints/online/managed/sample/endpoint.yml --set public_network_access=disabled
```
If you disable public network access for the endpoint, the only way to invoke the endpoint is by using a private endpoint, which can access the workspace, in your virtual network. For more information, see secure inbound scoring requests and configure a private endpoint for an Azure Machine Learning workspace.

Alternatively, if you'd like to allow the endpoint to receive scoring requests from the internet, uncomment the following code and run it instead.
```
# az ml online-endpoint create --name $ENDPOINT_NAME -f endpoints/online/managed/sample/endpoint.yml
```

Create a deployment in the workspace managed virtual network.

az ml online-deployment create --name blue --endpoint $ENDPOINT_NAME -f endpoints/online/managed/sample/blue-deployment.yml --all-traffic

Get the status of the deployment.

az ml online-endpoint show -n $ENDPOINT_NAME

Test the endpoint with a scoring request, using the CLI.

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file endpoints/online/model-1/sample-request.json

Get deployment logs.

az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

Delete the endpoint if you no longer need it.

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

Delete all the resources created in this article. Replace <resource-group-name> with the name of the resource group used in this example:
```
az group delete --resource-group <resource-group-name>
```

Troubleshooting

Online endpoint creation fails with a V1LegacyMode == true message

You can configure the Azure Machine Learning workspace for v1_legacy_mode, which disables v2 APIs. Managed online endpoints are a feature of the v2 API platform, and don't work if v1_legacy_mode is enabled for the workspace.

To disable v1_legacy_mode, see Network isolation with v2.

Important

Check with your network security team before you disable v1_legacy_mode, because they might have enabled it for a reason.

Online endpoint creation with key-based authentication fails

Use the following command to list the network rules of the Azure key vault for your workspace. Replace <keyvault-name> with the name of your key vault:

az keyvault network-rule list -n <keyvault-name>

The response for this command is similar to the following JSON code:

{
    "bypass": "AzureServices",
    "defaultAction": "Deny",
    "ipRules": [],
    "virtualNetworkRules": []
}

If the value of bypass isn't AzureServices, use the guidance in the Configure key vault network settings to set it to AzureServices.

Online deployments fail with an image download error

Note

This issue applies when you use the legacy network isolation method for managed online endpoints, in which Azure Machine Learning creates a managed virtual network for each deployment under an endpoint.

Check if the egress-public-network-access flag is disabled for the deployment. If this flag is enabled, and the visibility of the container registry is private, this failure is expected.
Use the following command to check the status of the private endpoint connection. Replace <registry-name> with the name of the Azure container registry for your workspace:
```
az acr private-endpoint-connection list -r <registry-name> --query "[?privateLinkServiceConnectionState.description=='Egress for Microsoft.MachineLearningServices/workspaces/onlineEndpoints'].{Name:name, status:privateLinkServiceConnectionState.status}"
```
In the response code, verify that the status field is set to Approved. If not, use the following command to approve it. Replace <private-endpoint-name> with the name returned from the preceding command.
```
az network private-endpoint-connection approve -n <private-endpoint-name>
```

Scoring endpoint can't be resolved

Verify that the client issuing the scoring request is a virtual network that can access the Azure Machine Learning workspace.
Use the nslookup command on the endpoint hostname to retrieve the IP address information, for example:
```
nslookup endpointname.westcentralus.inference.ml.azure.com
```
The response contains an address that should be in the range provided by the virtual network.
Note
- For Kubernetes online endpoint, the endpoint hostname should be the CName (domain name) that's specified in your Kubernetes cluster.
- If the endpoint is HTTP, the IP address is contained in the endpoint URI, which you can get from the studio UI.
- You can find more ways to get the IP address of the endpoint in Secure Kubernetes online endpoint.
If the nslookup command doesn't resolve the host name, take the following actions:

Managed online endpoints

Use the following command to check whether an A record exists in the private Domain Name Server (DNS) zone for the virtual network.
```
az network private-dns record-set list -z privatelink.api.azureml.ms -o tsv --query [].name
```
The results should contain an entry similar to *.<GUID>.inference.<region>.
If no inference value returns, delete the private endpoint for the workspace and then recreate it. For more information, see How to configure a private endpoint.
If the workspace with a private endpoint uses a custom DNS server, run the following command to verify that the resolution from custom DNS works correctly.

dig endpointname.westcentralus.inference.ml.azure.com

Kubernetes online endpoints

Check the DNS configuration in the Kubernetes cluster.

Also check if the azureml-fe works as expected, by using the following command:

kubectl exec -it deploy/azureml-fe -- /bin/bash
(Run in azureml-fe pod)

curl -vi -k https://localhost:<port>/api/v1/endpoint/<endpoint-name>/swagger.json
"Swagger not found"

For HTTP, use the following command:

 curl https://localhost:<port>/api/v1/endpoint/<endpoint-name>/swagger.json
"Swagger not found"

If curl HTTPs fails or times out but HTTP works, check whether the certificate is valid.
If the preceding process fails to resolve to the A record, verify if the resolution works from Azure DNS (168.63.129.16).
```
dig @168.63.129.16 endpointname.westcentralus.inference.ml.azure.com
```
If the preceding command succeeds, troubleshoot the conditional forwarder for private link on custom DNS.

Online deployments can't be scored

Run the following command to see if the deployment was successful:
```
az ml online-deployment show -e <endpointname> -n <deploymentname> --query '{name:name,state:provisioning_state}' 
```
If the deployment completed successfully, the value of state is Succeeded.
If the deployment was successful, use the following command to check that traffic is assigned to the deployment. Replace <endpointname> with the name of your endpoint.
```
az ml online-endpoint show -n <endpointname>  --query traffic
```
The response from this command should list the percentage of traffic assigned to deployments.

Tip

This step isn't necessary if you use the azureml-model-deployment header in your request to target this deployment.
If the traffic assignments or deployment header are set correctly, use the following command to get the logs for the endpoint. Replace <endpointname> with the name of the endpoint, and <deploymentname> with the deployment.
```
az ml online-deployment get-logs  -e <endpointname> -n <deploymentname> 
```
Review the logs to see if there's a problem running the scoring code when you submit a request to the deployment.

Del via