Safe rollout for online endpoints

APPLIES TO: Azure CLI ml extension v2 (current)

APPLIES TO: Python SDK azure-ai-ml v2 (current)

In this article, you'll learn how to deploy a new version of a machine learning model in production without causing any disruption. You'll use blue-green deployment, also known as a safe rollout strategy, to introduce a new version of a web service to production. This strategy will allow you to roll out your new version of the web service to a small subset of users or requests before rolling it out completely.

This article assumes you're using online endpoints, that is, endpoints that are used for online (real-time) inferencing. There are two types of online endpoints: managed online endpoints and Kubernetes online endpoints. For more information on endpoints and the differences between managed online endpoints and Kubernetes online endpoints, see What are Azure Machine Learning endpoints?.


The main example in this article uses managed online endpoints for deployment. To use Kubernetes endpoints instead, see the notes in this document inline with the managed online endpoints discussion.

In this article, you'll learn to:

  • Define an online endpoint and a deployment called "blue" to serve version 1 of a model
  • Scale the blue deployment so that it can handle more requests
  • Deploy version 2 of the model (called the "green" deployment) to the endpoint, but send the deployment no live traffic
  • Test the green deployment in isolation
  • Mirror a percentage of live traffic to the green deployment to validate it (preview)
  • Send a small percentage of live traffic to the green deployment
  • Send over all live traffic to the green deployment
  • Delete the now-unused v1 blue deployment


Before following the steps in this article, make sure you have the following prerequisites:

  • Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to an Azure Machine Learning workspace.

  • If you haven't already set the defaults for the Azure CLI, save your default settings. To avoid passing in the values for your subscription, workspace, and resource group multiple times, run this code:

    az account set --subscription <subscription id>
    az configure --defaults workspace=<azureml workspace name> group=<resource group>
  • (Optional) To deploy locally, you must install Docker Engine on your local computer. We highly recommend this option, so it's easier to debug issues.

Prepare your system

Clone the examples repository

To follow along with this article, first clone the examples repository (azureml-examples). Then, go to the repository's cli/ directory:

git clone --depth 1
cd azureml-examples
cd cli


Use --depth 1 to clone only the latest commit to the repository. This reduces the time to complete the operation.

The commands in this tutorial are in the file in the cli directory, and the YAML configuration files are in the endpoints/online/managed/sample/ subdirectory.


The YAML configuration files for Kubernetes online endpoints are in the endpoints/online/kubernetes/ subdirectory.

Define the endpoint and deployment

Online endpoints are used for online (real-time) inferencing. Online endpoints contain deployments that are ready to receive data from clients and can send responses back in real time.

Create online endpoint

To create an online endpoint:

  1. Set your endpoint name:

    For Unix, run this command (replace YOUR_ENDPOINT_NAME with a unique name):



    Endpoint names must be unique within an Azure region. For example, in the Azure westus2 region, there can be only one endpoint with the name my-endpoint.

  2. Create the endpoint in the cloud, run the following code:

    az ml online-endpoint create --name $ENDPOINT_NAME -f endpoints/online/managed/sample/endpoint.yml

Create the 'blue' deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. To create a deployment named blue for your endpoint, run the following command:

az ml online-deployment create --name blue --endpoint-name $ENDPOINT_NAME -f endpoints/online/managed/sample/blue-deployment.yml --all-traffic

Confirm your existing deployment

You can view the status of your existing endpoint and deployment by running:

az ml online-endpoint show --name $ENDPOINT_NAME 

az ml online-deployment show --name blue --endpoint $ENDPOINT_NAME 

You should see the endpoint identified by $ENDPOINT_NAME and, a deployment called blue.

Test the endpoint with sample data

The endpoint can be invoked using the invoke command. We'll send a sample request using a json file.

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file endpoints/online/model-1/sample-request.json

Scale your existing deployment to handle more traffic

In the deployment described in Deploy and score a machine learning model with an online endpoint, you set the instance_count to the value 1 in the deployment yaml file. You can scale out using the update command:

az ml online-deployment update --name blue --endpoint-name $ENDPOINT_NAME --set instance_count=2


Notice that in the above command we use --set to override the deployment configuration. Alternatively you can update the yaml file and pass it as an input to the update command using the --file input.

Deploy a new model, but send it no traffic yet

Create a new deployment named green:

az ml online-deployment create --name green --endpoint-name $ENDPOINT_NAME -f endpoints/online/managed/sample/green-deployment.yml

Since we haven't explicitly allocated any traffic to green, it will have zero traffic allocated to it. You can verify that using the command:

az ml online-endpoint show -n $ENDPOINT_NAME --query traffic

Test the new deployment

Though green has 0% of traffic allocated, you can invoke it directly by specifying the --deployment name:

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment-name green --request-file endpoints/online/model-2/sample-request.json

If you want to use a REST client to invoke the deployment directly without going through traffic rules, set the following HTTP header: azureml-model-deployment: <deployment-name>. The below code snippet uses curl to invoke the deployment directly. The code snippet should work in Unix/WSL environments:

# get the scoring uri
SCORING_URI=$(az ml online-endpoint show -n $ENDPOINT_NAME -o tsv --query scoring_uri)
# use curl to invoke the endpoint
curl --request POST "$SCORING_URI" --header "Authorization: Bearer $ENDPOINT_KEY" --header 'Content-Type: application/json' --header "azureml-model-deployment: green" --data @endpoints/online/model-2/sample-request.json

Test the deployment with mirrored traffic (preview)


This feature is currently in public preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

Once you've tested your green deployment, you can 'mirror' (or copy) a percentage of the live traffic to it. Mirroring traffic (also called shadowing) doesn't change the results returned to clients. Requests still flow 100% to the blue deployment. The mirrored percentage of the traffic is copied and submitted to the green deployment so you can gather metrics and logging without impacting your clients. Mirroring is useful when you want to validate a new deployment without impacting clients; for example, to check if latency is within acceptable bounds and that there are no HTTP errors. Testing the new deployment with traffic mirroring/shadowing is also known as shadow testing. The deployment receiving the mirrored traffic (in this case, the green deployment) can also be called the shadow deployment.


Mirroring traffic uses your endpoint bandwidth quota (default 5 MBPS). Your endpoint bandwidth will be throttled if you exceed the allocated quota. For information on monitoring bandwidth throttling, see Monitor managed online endpoints.


Mirrored traffic is supported for the CLI (v2) (version 2.4.0 or above) and Python SDK (v2) (version 1.0.0 or above). If you update the endpoint using an older version of CLI/SDK or Studio UI, the setting for mirrored traffic will be removed.

The following command mirrors 10% of the traffic to the green deployment:

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic "green=10"

You can test mirror traffic by invoking the endpoint several times:

for i in {1..20} ; do
    az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file endpoints/online/model-1/sample-request.json

Mirroring has the following limitations:

  • You can only mirror traffic to one deployment.
  • Mirror traffic isn't currently supported for Kubernetes online endpoints.
  • The maximum mirrored traffic you can configure is 50%. This limit is to reduce the impact on your endpoint bandwidth quota.

Also note the following behavior:

  • A deployment can only be set to live or mirror traffic, not both.
  • You can send traffic directly to the mirror deployment by specifying the deployment set for mirror traffic.
  • You can send traffic directly to a live deployment by specifying the deployment set for live traffic, but in this case the traffic won't be mirrored to the mirror deployment. Mirror traffic is routed from traffic sent to endpoint without specifying the deployment.


You can use --deployment-name option for CLI v2, or deployment_name option for SDK v2 to specify the deployment to be routed to.

Diagram showing 10% traffic mirrored to one deployment.

You can confirm that the specific percentage of the traffic was sent to the green deployment by seeing the logs from the deployment:

az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

After testing, you can set the mirror traffic to zero to disable mirroring:

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic "green=0"

Test the new deployment with a small percentage of live traffic

Once you've tested your green deployment, allocate a small percentage of traffic to it:

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=90 green=10"

Now, your green deployment will receive 10% of requests.

Diagram showing traffic split between deployments.

Send all traffic to your new deployment

Once you're fully satisfied with your green deployment, switch all traffic to it.

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=0 green=100"

Remove the old deployment

az ml online-deployment delete --name blue --endpoint $ENDPOINT_NAME --yes --no-wait

Delete the endpoint and deployment

If you aren't going use the deployment, you should delete it with:

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait

Next steps