Perform safe rollout of new deployments for real-time inference

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you'll learn how to deploy a new version of a machine learning model in production without causing any disruption. You'll use a blue-green deployment strategy (also known as a safe rollout strategy) to introduce a new version of a web service to production. This strategy will allow you to roll out your new version of the web service to a small subset of users or requests before rolling it out completely.

This article assumes you're using online endpoints, that is, endpoints that are used for online (real-time) inferencing. There are two types of online endpoints: managed online endpoints and Kubernetes online endpoints. For more information on endpoints and the differences between managed online endpoints and Kubernetes online endpoints, see What are Azure Machine Learning endpoints?.

The main example in this article uses managed online endpoints for deployment. To use Kubernetes endpoints instead, see the notes in this document that are inline with the managed online endpoint discussion.

In this article, you'll learn to:

  • Define an online endpoint with a deployment called "blue" to serve version 1 of a model
  • Scale the blue deployment so that it can handle more requests
  • Deploy version 2 of the model (called the "green" deployment) to the endpoint, but send the deployment no live traffic
  • Test the green deployment in isolation
  • Mirror a percentage of live traffic to the green deployment to validate it
  • Send a small percentage of live traffic to the green deployment
  • Send over all live traffic to the green deployment
  • Delete the now-unused v1 blue deployment


Before following the steps in this article, make sure you have the following prerequisites:

  • Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to an Azure Machine Learning workspace.

  • (Optional) To deploy locally, you must install Docker Engine on your local computer. We highly recommend this option, so it's easier to debug issues.

Prepare your system

Set environment variables

If you haven't already set the defaults for the Azure CLI, save your default settings. To avoid passing in the values for your subscription, workspace, and resource group multiple times, run this code:

az account set --subscription <subscription id>
az configure --defaults workspace=<Azure Machine Learning workspace name> group=<resource group>

Clone the examples repository

To follow along with this article, first clone the examples repository (azureml-examples). Then, go to the repository's cli/ directory:

git clone --depth 1
cd azureml-examples
cd cli


Use --depth 1 to clone only the latest commit to the repository. This reduces the time to complete the operation.

The commands in this tutorial are in the file in the cli directory, and the YAML configuration files are in the endpoints/online/managed/sample/ subdirectory.


The YAML configuration files for Kubernetes online endpoints are in the endpoints/online/kubernetes/ subdirectory.

Define the endpoint and deployment

Online endpoints are used for online (real-time) inferencing. Online endpoints contain deployments that are ready to receive data from clients and send responses back in real time.

Define an endpoint

The following table lists key attributes to specify when you define an endpoint.

Attribute Description
Name Required. Name of the endpoint. It must be unique in the Azure region. For more information on the naming rules, see endpoint limits.
Authentication mode The authentication method for the endpoint. Choose between key-based authentication key and Azure Machine Learning token-based authentication aml_token. A key doesn't expire, but a token does expire. For more information on authenticating, see Authenticate to an online endpoint.
Description Description of the endpoint.
Tags Dictionary of tags for the endpoint.
Traffic Rules on how to route traffic across deployments. Represent the traffic as a dictionary of key-value pairs, where key represents the deployment name and value represents the percentage of traffic to that deployment. You can set the traffic only when the deployments under an endpoint have been created. You can also update the traffic for an online endpoint after the deployments have been created. For more information on how to use mirrored traffic, see Allocate a small percentage of live traffic to the new deployment.
Mirror traffic Percentage of live traffic to mirror to a deployment. For more information on how to use mirrored traffic, see Test the deployment with mirrored traffic.

To see a full list of attributes that you can specify when you create an endpoint, see CLI (v2) online endpoint YAML schema or SDK (v2) ManagedOnlineEndpoint Class.

Define a deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. The following table describes key attributes to specify when you define a deployment.

Attribute Description
Name Required. Name of the deployment.
Endpoint name Required. Name of the endpoint to create the deployment under.
Model The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification. In the example, we have a scikit-learn model that does regression.
Code path The path to the directory on the local development environment that contains all the Python source code for scoring the model. You can use nested directories and packages.
Scoring script Python code that executes the model on a given input request. This value can be the relative path to the scoring file in the source code directory.
The scoring script receives data submitted to a deployed web service and passes it to the model. The script then executes the model and returns its response to the client. The scoring script is specific to your model and must understand the data that the model expects as input and returns as output.
In this example, we have a file. This Python code must have an init() function and a run() function. The init() function will be called after the model is created or updated (you can use it to cache the model in memory, for example). The run() function is called at every invocation of the endpoint to do the actual scoring and prediction.
Environment Required. The environment to host the model and code. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification. The environment can be a Docker image with Conda dependencies, a Dockerfile, or a registered environment.
Instance type Required. The VM size to use for the deployment. For the list of supported sizes, see Managed online endpoints SKU list.
Instance count Required. The number of instances to use for the deployment. Base the value on the workload you expect. For high availability, we recommend that you set the value to at least 3. We reserve an extra 20% for performing upgrades. For more information, see limits for online endpoints.

To see a full list of attributes that you can specify when you create a deployment, see CLI (v2) managed online deployment YAML schema or SDK (v2) ManagedOnlineDeployment Class.

Create online endpoint

First set the endpoint's name and then configure it. In this article, you'll use the endpoints/online/managed/sample/endpoint.yml file to configure the endpoint. The following snippet shows the contents of the file:

name: my-endpoint
auth_mode: key

The reference for the endpoint YAML format is described in the following table. To learn how to specify these attributes, see the online endpoint YAML reference. For information about limits related to managed online endpoints, see limits for online endpoints.

Key Description
$schema (Optional) The YAML schema. To see all available options in the YAML file, you can view the schema in the preceding code snippet in a browser.
name The name of the endpoint.
auth_mode Use key for key-based authentication. Use aml_token for Azure Machine Learning token-based authentication. To get the most recent token, use the az ml online-endpoint get-credentials command.

To create an online endpoint:

  1. Set your endpoint name:

    For Unix, run this command (replace YOUR_ENDPOINT_NAME with a unique name):



    Endpoint names must be unique within an Azure region. For example, in the Azure westus2 region, there can be only one endpoint with the name my-endpoint.

  2. Create the endpoint in the cloud:

    Run the following code to use the endpoint.yml file to configure the endpoint:

    az ml online-endpoint create --name $ENDPOINT_NAME -f endpoints/online/managed/sample/endpoint.yml

Create the 'blue' deployment

In this article, you'll use the endpoints/online/managed/sample/blue-deployment.yml file to configure the key aspects of the deployment. The following snippet shows the contents of the file:

name: blue
endpoint_name: my-endpoint
  path: ../../model-1/model/
  code: ../../model-1/onlinescoring/
  conda_file: ../../model-1/environment/conda.yaml
instance_type: Standard_DS3_v2
instance_count: 1

To create a deployment named blue for your endpoint, run the following command to use the blue-deployment.yml file to configure

az ml online-deployment create --name blue --endpoint-name $ENDPOINT_NAME -f endpoints/online/managed/sample/blue-deployment.yml --all-traffic


The --all-traffic flag in the az ml online-deployment create allocates 100% of the endpoint traffic to the newly created blue deployment.

In the blue-deployment.yaml file, we specify the path (where to upload files from) inline. The CLI automatically uploads the files and registers the model and environment. As a best practice for production, you should register the model and environment and specify the registered name and version separately in the YAML. Use the form model: azureml:my-model:1 or environment: azureml:my-env:1.

For registration, you can extract the YAML definitions of model and environment into separate YAML files and use the commands az ml model create and az ml environment create. To learn more about these commands, run az ml model create -h and az ml environment create -h.

For more information on registering your model as an asset, see Register your model as an asset in Machine Learning by using the CLI. For more information on creating an environment, see Manage Azure Machine Learning environments with the CLI & SDK (v2).

Confirm your existing deployment

One way to confirm your existing deployment is to invoke your endpoint so that it can score your model for a given input request. When you invoke your endpoint via the CLI or Python SDK, you can choose to specify the name of the deployment that will receive the incoming traffic.


Unlike the CLI or Python SDK, Azure Machine Learning studio requires you to specify a deployment when you invoke an endpoint.

Invoke endpoint with deployment name

If you invoke the endpoint with the name of the deployment that will receive traffic, Azure Machine Learning will route the endpoint's traffic directly to the specified deployment and return its output. You can use the --deployment-name option for CLI v2, or deployment_name option for SDK v2 to specify the deployment.

Invoke endpoint without specifying deployment

If you invoke the endpoint without specifying the deployment that will receive traffic, Azure Machine Learning will route the endpoint's incoming traffic to the deployment(s) in the endpoint based on traffic control settings.

Traffic control settings allocate specified percentages of incoming traffic to each deployment in the endpoint. For example, if your traffic rules specify that a particular deployment in your endpoint will receive incoming traffic 40% of the time, Azure Machine Learning will route 40% of the endpoint's traffic to that deployment.

You can view the status of your existing endpoint and deployment by running:

az ml online-endpoint show --name $ENDPOINT_NAME 

az ml online-deployment show --name blue --endpoint $ENDPOINT_NAME 

You should see the endpoint identified by $ENDPOINT_NAME and, a deployment called blue.

Test the endpoint with sample data

The endpoint can be invoked using the invoke command. We'll send a sample request using a json file.

az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file endpoints/online/model-1/sample-request.json

Scale your existing deployment to handle more traffic

In the deployment described in Deploy and score a machine learning model with an online endpoint, you set the instance_count to the value 1 in the deployment yaml file. You can scale out using the update command:

az ml online-deployment update --name blue --endpoint-name $ENDPOINT_NAME --set instance_count=2


Notice that in the above command we use --set to override the deployment configuration. Alternatively you can update the yaml file and pass it as an input to the update command using the --file input.

Deploy a new model, but send it no traffic yet

Create a new deployment named green:

az ml online-deployment create --name green --endpoint-name $ENDPOINT_NAME -f endpoints/online/managed/sample/green-deployment.yml

Since we haven't explicitly allocated any traffic to green, it has zero traffic allocated to it. You can verify that using the command:

az ml online-endpoint show -n $ENDPOINT_NAME --query traffic

Test the new deployment

Though green has 0% of traffic allocated, you can invoke it directly by specifying the --deployment name:

az ml online-endpoint invoke --name $ENDPOINT_NAME --deployment-name green --request-file endpoints/online/model-2/sample-request.json

If you want to use a REST client to invoke the deployment directly without going through traffic rules, set the following HTTP header: azureml-model-deployment: <deployment-name>. The below code snippet uses curl to invoke the deployment directly. The code snippet should work in Unix/WSL environments:

# get the scoring uri
SCORING_URI=$(az ml online-endpoint show -n $ENDPOINT_NAME -o tsv --query scoring_uri)
# use curl to invoke the endpoint
curl --request POST "$SCORING_URI" --header "Authorization: Bearer $ENDPOINT_KEY" --header 'Content-Type: application/json' --header "azureml-model-deployment: green" --data @endpoints/online/model-2/sample-request.json

Test the deployment with mirrored traffic

Once you've tested your green deployment, you can mirror (or copy) a percentage of the live traffic to it. Traffic mirroring (also called shadowing) doesn't change the results returned to clients—requests still flow 100% to the blue deployment. The mirrored percentage of the traffic is copied and submitted to the green deployment so that you can gather metrics and logging without impacting your clients. Mirroring is useful when you want to validate a new deployment without impacting clients. For example, you can use mirroring to check if latency is within acceptable bounds or to check that there are no HTTP errors. Testing the new deployment with traffic mirroring/shadowing is also known as shadow testing. The deployment receiving the mirrored traffic (in this case, the green deployment) can also be called the shadow deployment.

Mirroring has the following limitations:

  • Mirroring is supported for the CLI (v2) (version 2.4.0 or above) and Python SDK (v2) (version 1.0.0 or above). If you use an older version of CLI/SDK to update an endpoint, you'll lose the mirror traffic setting.
  • Mirroring isn't currently supported for Kubernetes online endpoints.
  • You can mirror traffic to only one deployment in an endpoint.
  • The maximum percentage of traffic you can mirror is 50%. This limit is to reduce the effect on your endpoint bandwidth quota (default 5 MBPS)—your endpoint bandwidth is throttled if you exceed the allocated quota. For information on monitoring bandwidth throttling, see Monitor managed online endpoints.

Also note the following behaviors:

  • A deployment can be configured to receive only live traffic or mirrored traffic, not both.
  • When you invoke an endpoint, you can specify the name of any of its deployments — even a shadow deployment — to return the prediction.
  • When you invoke an endpoint with the name of the deployment that will receive incoming traffic, Azure Machine Learning won't mirror traffic to the shadow deployment. Azure Machine Learning mirrors traffic to the shadow deployment from traffic sent to the endpoint when you don't specify a deployment.

Now, let's set the green deployment to receive 10% of mirrored traffic. Clients will still receive predictions from the blue deployment only.

Diagram showing 10% traffic mirrored to one deployment.

The following command mirrors 10% of the traffic to the green deployment:

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic "green=10"

You can test mirror traffic by invoking the endpoint several times without specifying a deployment to receive the incoming traffic:

for i in {1..20} ; do
    az ml online-endpoint invoke --name $ENDPOINT_NAME --request-file endpoints/online/model-1/sample-request.json

You can confirm that the specific percentage of the traffic was sent to the green deployment by seeing the logs from the deployment:

az ml online-deployment get-logs --name blue --endpoint $ENDPOINT_NAME

After testing, you can set the mirror traffic to zero to disable mirroring:

az ml online-endpoint update --name $ENDPOINT_NAME --mirror-traffic "green=0"

Allocate a small percentage of live traffic to the new deployment

Once you've tested your green deployment, allocate a small percentage of traffic to it:

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=90 green=10"


The total traffic percentage must sum to either 0% (to disable traffic) or 100% (to enable traffic).

Now, your green deployment receives 10% of all live traffic. Clients will receive predictions from both the blue and green deployments.

Diagram showing traffic split between deployments.

Send all traffic to your new deployment

Once you're fully satisfied with your green deployment, switch all traffic to it.

az ml online-endpoint update --name $ENDPOINT_NAME --traffic "blue=0 green=100"

Remove the old deployment

Use the following steps to delete an individual deployment from a managed online endpoint. Deleting an individual deployment does affect the other deployments in the managed online endpoint:

az ml online-deployment delete --name blue --endpoint $ENDPOINT_NAME --yes --no-wait

Delete the endpoint and deployment

If you aren't going to use the endpoint and deployment, you should delete them. By deleting the endpoint, you'll also delete all its underlying deployments.

az ml online-endpoint delete --name $ENDPOINT_NAME --yes --no-wait