Edit

Share via


Deploy and score a machine learning model by using an online endpoint

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

In this article, you learn to deploy your model to an online endpoint for use in real-time inferencing. You begin by deploying a model on your local machine to debug any errors. Then, you deploy and test the model in Azure, view the deployment logs, and monitor the service-level agreement (SLA). By the end of this article, you'll have a scalable HTTPS/REST endpoint that you can use for real-time inference.

Online endpoints are endpoints that are used for real-time inferencing. There are two types of online endpoints: managed online endpoints and Kubernetes online endpoints. For more information on endpoints and differences between managed online endpoints and Kubernetes online endpoints, see What are Azure Machine Learning endpoints?

Managed online endpoints help to deploy your machine learning models in a turnkey manner. Managed online endpoints work with powerful CPU and GPU machines in Azure in a scalable, fully managed way. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure.

The main example in this doc uses managed online endpoints for deployment. To use Kubernetes instead, see the notes in this document that are inline with the managed online endpoint discussion.

Prerequisites

APPLIES TO: Python SDK azure-ai-ml v2 (current)

  • An Azure Machine Learning workspace. For steps for creating a workspace, see Create the workspace.

  • The Azure Machine Learning SDK for Python v2. To install the SDK, use the following command:

    pip install azure-ai-ml azure-identity
    

    To update an existing installation of the SDK to the latest version, use the following command:

    pip install --upgrade azure-ai-ml azure-identity
    

    For more information, see Azure Machine Learning Package client library for Python.

  • Azure role-based access controls (Azure RBAC) are used to grant access to operations in Azure Machine Learning. To perform the steps in this article, your user account must be assigned the owner or contributor role for the Azure Machine Learning workspace, or a custom role allowing Microsoft.MachineLearningServices/workspaces/onlineEndpoints/*. For more information, see Manage access to an Azure Machine Learning workspace.

  • (Optional) To deploy locally, you must install Docker Engine on your local computer. We highly recommend this option, so it's easier to debug issues.

  • Ensure that you have enough virtual machine (VM) quota allocated for deployment. Azure Machine Learning reserves 20% of your compute resources for performing upgrades on some VM SKUs. For example, if you request 10 instances in a deployment, you must have a quota for 12 for each number of cores for the VM SKU. Failure to account for the extra compute resources results in an error. There are some VM SKUs that are exempt from the extra quota reservation. For more information on quota allocation, see virtual machine quota allocation for deployment.

  • Alternatively, you could use quota from Azure Machine Learning's shared quota pool for a limited time. Azure Machine Learning provides a shared quota pool from which users across various regions can access quota to perform testing for a limited time, depending upon availability. When you use the studio to deploy Llama-2, Phi, Nemotron, Mistral, Dolly, and Deci-DeciLM models from the model catalog to a managed online endpoint, Azure Machine Learning allows you to access its shared quota pool for a short time so that you can perform testing. For more information on the shared quota pool, see Azure Machine Learning shared quota.

Prepare your system

Clone the examples repository

To run the training examples, first clone the examples repository (azureml-examples) and change into the azureml-examples/sdk/python/endpoints/online/managed directory:

git clone --depth 1 https://github.com/Azure/azureml-examples
cd azureml-examples/sdk/python/endpoints/online/managed

Tip

Use --depth 1 to clone only the latest commit to the repository, which reduces time to complete the operation.

The information in this article is based on the online-endpoints-simple-deployment.ipynb notebook. It contains the same content as this article, although the order of the codes is slightly different.

Connect to Azure Machine Learning workspace

The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section, you connect to the workspace in which you'll perform deployment tasks. To follow along, open your online-endpoints-simple-deployment.ipynb notebook.

  1. Import the required libraries:

    # import required libraries
    from azure.ai.ml import MLClient
    from azure.ai.ml.entities import (
        ManagedOnlineEndpoint,
        ManagedOnlineDeployment,
        Model,
        Environment,
        CodeConfiguration
    )
    from azure.identity import DefaultAzureCredential
    

    Note

    If you're using the Kubernetes online endpoint, import the KubernetesOnlineEndpoint and KubernetesOnlineDeployment class from the azure.ai.ml.entities library.

  2. Configure workspace details and get a handle to the workspace:

    To connect to a workspace, you need identifier parameters - a subscription, resource group, and workspace name. You use these details in the MLClient from azure.ai.ml to get a handle to the required Azure Machine Learning workspace. This example uses the default Azure authentication.

    # enter details of your Azure Machine Learning workspace
    subscription_id = "<SUBSCRIPTION_ID>"
    resource_group = "<RESOURCE_GROUP>"
    workspace = "<AZUREML_WORKSPACE_NAME>"
    
    # get a handle to the workspace
    ml_client = MLClient(
        DefaultAzureCredential(), subscription_id, resource_group, workspace
    )
    

Define the endpoint

To define an online endpoint, specify the endpoint name and authentication mode. For more information on managed online endpoints, see Online endpoints.

Configure an endpoint

First define the name of the online endpoint, then configure the endpoint.

Your endpoint name must be unique in the Azure region. For more information on the naming rules, see endpoint limits.

# Define an endpoint name
endpoint_name = "my-endpoint"

# Example way to define a random name
import datetime

endpoint_name = "endpt-" + datetime.datetime.now().strftime("%m%d%H%M%f")

# create an online endpoint
endpoint = ManagedOnlineEndpoint(
    name = endpoint_name, 
    description="this is a sample endpoint",
    auth_mode="key"
)

The previous code uses key for key-based authentication. To use Azure Machine Learning token-based authentication, use aml_token. To use Microsoft Entra token-based authentication (preview), use aad_token. For more information on authenticating, see Authenticate clients for online endpoints.

Define the deployment

A deployment is a set of resources required for hosting the model that does the actual inferencing. For this example, you deploy a scikit-learn model that does regression and use a scoring script score.py to execute the model upon a given input request.

To learn about the key attributes of a deployment, see Online deployments.

Configure a deployment

Your deployment configuration uses the location of the model that you wish to deploy.

To configure a deployment:

model = Model(path="../model-1/model/sklearn_regression_model.pkl")
env = Environment(
    conda_file="../model-1/environment/conda.yaml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest",
)

blue_deployment = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="../model-1/onlinescoring", scoring_script="score.py"
    ),
    instance_type="Standard_DS3_v2",
    instance_count=1,
)
  • Model - specifies the model properties inline, using the path (where to upload files from). The SDK automatically uploads the model files and registers the model with an autogenerated name.
  • Environment - using inline definitions that include where to upload files from, the SDK automatically uploads the conda.yaml file and registers the environment. Later, to build the environment, the deployment uses the image (in this example, it's mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest) for the base image, and the conda_file dependencies are installed on top of the base image.
  • CodeConfiguration - during deployment, the local files such as the Python source for the scoring model, are uploaded from the development environment.

For more information about online deployment definition, see OnlineDeployment Class.

Understand the scoring script

Tip

The format of the scoring script for online endpoints is the same format that's used in the preceding version of the CLI and in the Python SDK.

The scoring script must have an init() function and a run() function.

This example uses the score.py file: score.py

import os
import logging
import json
import numpy
import joblib


def init():
    """
    This function is called when the container is initialized/started, typically after create/update of the deployment.
    You can write the logic here to perform init operations like caching the model in memory
    """
    global model
    # AZUREML_MODEL_DIR is an environment variable created during deployment.
    # It is the path to the model folder (./azureml-models/$MODEL_NAME/$VERSION)
    # Please provide your model's folder name if there is one
    model_path = os.path.join(
        os.getenv("AZUREML_MODEL_DIR"), "model/sklearn_regression_model.pkl"
    )
    # deserialize the model file back into a sklearn model
    model = joblib.load(model_path)
    logging.info("Init complete")


def run(raw_data):
    """
    This function is called for every invocation of the endpoint to perform the actual scoring/prediction.
    In the example we extract the data from the json input and call the scikit-learn model's predict()
    method and return the result back
    """
    logging.info("model 1: request received")
    data = json.loads(raw_data)["data"]
    data = numpy.array(data)
    result = model.predict(data)
    logging.info("Request processed")
    return result.tolist()

The init() function is called when the container is initialized or started. Initialization typically occurs shortly after the deployment is created or updated. The init function is the place to write logic for global initialization operations like caching the model in memory (as shown in this score.py file).

The run() function is called every time the endpoint is invoked, and it does the actual scoring and prediction. In this score.py file, the run() function extracts data from a JSON input, calls the scikit-learn model's predict() method, and then returns the prediction result.

Deploy and debug locally by using a local endpoint

We highly recommend that you test-run your endpoint locally to validate and debug your code and configuration before you deploy to Azure. Azure CLI and Python SDK support local endpoints and deployments, while Azure Machine Learning studio and ARM template don't.

To deploy locally, Docker Engine must be installed and running. Docker Engine typically starts when the computer starts. If it doesn't, you can troubleshoot Docker Engine.

Tip

You can use Azure Machine Learning inference HTTP server Python package to debug your scoring script locally without Docker Engine. Debugging with the inference server helps you to debug the scoring script before deploying to local endpoints so that you can debug without being affected by the deployment container configurations.

For more information on debugging online endpoints locally before deploying to Azure, see Online endpoint debugging.

Deploy the model locally

First create an endpoint. Optionally, for a local endpoint, you can skip this step and directly create the deployment (next step), which will, in turn, create the required metadata. Deploying models locally is useful for development and testing purposes.

ml_client.online_endpoints.begin_create_or_update(endpoint, local=True)

Now, create a deployment named blue under the endpoint.

ml_client.online_deployments.begin_create_or_update(
    deployment=blue_deployment, local=True
)

The local=True flag directs the SDK to deploy the endpoint in the Docker environment.

Tip

Use Visual Studio Code to test and debug your endpoints locally. For more information, see debug online endpoints locally in Visual Studio Code.

Verify that the local deployment succeeded

Check the deployment status to see whether the model was deployed without error:

ml_client.online_endpoints.get(name=endpoint_name, local=True)

The method returns ManagedOnlineEndpoint entity. The provisioning_state is Succeeded.

ManagedOnlineEndpoint({'public_network_access': None, 'provisioning_state': 'Succeeded', 'scoring_uri': 'http://localhost:49158/score', 'swagger_uri': None, 'name': 'endpt-10061534497697', 'description': 'this is a sample endpoint', 'tags': {}, 'properties': {}, 'id': None, 'Resource__source_path': None, 'base_path': '/path/to/your/working/directory', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7ffb781bccd0>, 'auth_mode': 'key', 'location': 'local', 'identity': None, 'traffic': {}, 'mirror_traffic': {}, 'kind': None})

The following table contains the possible values for provisioning_state:

Value Description
Creating The resource is being created.
Updating The resource is being updated.
Deleting The resource is being deleted.
Succeeded The create/update operation succeeded.
Failed The create/update/delete operation failed.

Invoke the local endpoint to score data by using your model

Invoke the endpoint to score the model by using the invoke command and passing query parameters that are stored in a JSON file.

ml_client.online_endpoints.invoke(
    endpoint_name=endpoint_name,
    request_file="../model-1/sample-request.json",
    local=True,
)

If you want to use a REST client (like curl), you must have the scoring URI. To get the scoring URI, run the following code. In the returned data, find the scoring_uri attribute.

endpoint = ml_client.online_endpoints.get(endpoint_name, local=True)
scoring_uri = endpoint.scoring_uri

Review the logs for output from the invoke operation

In the example score.py file, the run() method logs some output to the console.

You can view this output by using the get_logs method:

ml_client.online_deployments.get_logs(
    name="blue", endpoint_name=endpoint_name, local=True, lines=50
)

Deploy your online endpoint to Azure

Next, deploy your online endpoint to Azure. As a best practice for production, we recommend that you register the model and environment that you'll use in your deployment.

Register your model and environment

We recommend that you register your model and environment before deployment to Azure so that you can specify their registered names and versions during deployment. Registering your assets allows you to reuse them without the need to upload them every time you create deployments, thereby increasing reproducibility and traceability.

Note

Unlike deployment to Azure, local deployment doesn't support using registered models and environments. Rather, local deployment uses local model files and uses environments with local files only. For deployment to Azure, you can use either local or registered assets (models and environments). In this section of the article, the deployment to Azure uses registered assets, but you have the option of using local assets instead. For an example of a deployment configuration that uploads local files to use for local deployment, see Configure a deployment.

  1. Register a model

    from azure.ai.ml.entities import Model
    from azure.ai.ml.constants import AssetTypes
    
    file_model = Model(
        path="../../model-1/model/",
        type=AssetTypes.CUSTOM_MODEL,
        name="my-model",
        description="Model created from local file.",
    )
    ml_client.models.create_or_update(file_model)
    
  2. Register the environment:

    from azure.ai.ml.entities import Environment
    
    env_docker_conda = Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="../../model-1/environment/conda.yaml",
        name="my-env",
        description="Environment created from a Docker image plus Conda environment.",
    )
    ml_client.environments.create_or_update(env_docker_conda)
    

To learn how to register your model as an asset so that you can specify its registered name and version during deployment, see Register your model as an asset in Machine Learning by using the SDK.

For more information on creating an environment, see Manage Azure Machine Learning environments with the CLI & SDK (v2).

Important

When defining a custom environment for your deployment, ensure the azureml-inference-server-http package is included in the conda file. This package is essential for the inference server to function properly. If you are unfamiliar with creating your own custom environment, it is advisable to instead use one of our curated environments such as minimal-py-inference (for custom models that don't use mlflow) or mlflow-py-inference (for models that use mlflow). These curated environments can be found in the "Environments" tab of your Machine Learning Studio.

Configure a deployment that uses registered assets

Your deployment configuration uses the registered model that you wish to deploy and your registered environment.

To configure a deployment, use the registered model and environment:

model = "azureml:my-model:1"
env = "azureml:my-env:1"
 
blue_deployment_with_registered_assets = ManagedOnlineDeployment(
    name="blue",
    endpoint_name=endpoint_name,
    model=model,
    environment=env,
    code_configuration=CodeConfiguration(
        code="../model-1/onlinescoring", scoring_script="score.py"
    ),
    instance_type="Standard_DS3_v2",
    instance_count=1,
)

Use different CPU and GPU instance types and images

You can specify the CPU or GPU instance types and images in your deployment configuration for both local deployment and deployment to Azure.

Earlier, you configured a deployment that used a general-purpose type Standard_DS3_v2 instance and a non-GPU Docker image mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest. For GPU compute, choose a GPU compute type SKU and a GPU Docker image.

For supported general-purpose and GPU instance types, see Managed online endpoints supported VM SKUs. For a list of Azure Machine Learning CPU and GPU base images, see Azure Machine Learning base images.

Note

To use Kubernetes, instead of managed endpoints, as a compute target, see Introduction to Kubernetes compute target.

Next, deploy your online endpoint to Azure.

Deploy to Azure

  1. Create the endpoint:

    Using the endpoint you defined earlier and the MLClient you created earlier, you can now create the endpoint in the workspace. This command starts the endpoint creation and returns a confirmation response while the endpoint creation continues.

    ml_client.online_endpoints.begin_create_or_update(endpoint)
    
  2. Create the deployment:

    Using the blue_deployment_with_registered_assets that you defined earlier and the MLClient you created earlier, you can now create the deployment in the workspace. This command starts the deployment creation and returns a confirmation response while the deployment creation continues.

    ml_client.online_deployments.begin_create_or_update(blue_deployment_with_registered_assets)
    

    Tip

    • If you prefer not to block your Python console, you can add the flag no_wait=True to the parameters. However, this option will stop the interactive display of the deployment status.
    # blue deployment takes 100 traffic
    endpoint.traffic = {"blue": 100}
    ml_client.online_endpoints.begin_create_or_update(endpoint)
    

To debug errors in your deployment, see Troubleshooting online endpoint deployments.

Check the status of the endpoint

  1. Check the endpoint's status to see whether the model was deployed without error:

    ml_client.online_endpoints.get(name=endpoint_name)
    
  2. List all the endpoints in the workspace in a table format by using the list method:

    for endpoint in ml_client.online_endpoints.list():
        print(endpoint.name)
    

    The method returns a list (iterator) of ManagedOnlineEndpoint entities.

  3. You can get more information by specifying more parameters. For example, output the list of endpoints like a table:

    print("Kind\tLocation\tName")
    print("-------\t----------\t------------------------")
    for endpoint in ml_client.online_endpoints.list():
        print(f"{endpoint.kind}\t{endpoint.location}\t{endpoint.name}")
    

Check the status of the online deployment

Check the logs to see whether the model was deployed without error.

  1. You can view log output by using the get_logs method:

    ml_client.online_deployments.get_logs(
        name="blue", endpoint_name=endpoint_name, lines=50
    )
    
  2. By default, logs are pulled from the inference server container. To see logs from the storage initializer container, add the container_type="storage-initializer" option. For more information on deployment logs, see Get container logs.

    ml_client.online_deployments.get_logs(
        name="blue", endpoint_name=endpoint_name, lines=50, container_type="storage-initializer"
    )
    

Invoke the endpoint to score data by using your model

Using the MLClient created earlier, get a handle to the endpoint. The endpoint can then be invoked using the invoke command with the following parameters:

  • endpoint_name - Name of the endpoint
  • request_file - File with request data
  • deployment_name - Name of the specific deployment to test in an endpoint
  1. Send a sample request using a json file.

    # test the blue deployment with some sample data
    ml_client.online_endpoints.invoke(
        endpoint_name=endpoint_name,
        deployment_name="blue",
        request_file="../model-1/sample-request.json",
    )
    

(Optional) Update the deployment

If you want to update the code, model, or environment, update the configuration, and then run the MLClient's online_deployments.begin_create_or_update method to create or update a deployment.

Note

If you update instance count (to scale your deployment) along with other model settings (such as code, model, or environment) in a single begin_create_or_update method, the scaling operation will be performed first, then the other updates will be applied. It's a good practice to perform these operations separately in a production environment.

To understand how begin_create_or_update works:

  1. Open the file online/model-1/onlinescoring/score.py.

  2. Change the last line of the init() function: After logging.info("Init complete"), add logging.info("Updated successfully").

  3. Save the file.

  4. Run the method:

    ml_client.online_deployments.begin_create_or_update(blue_deployment_with_registered_assets)
    
  5. Because you modified the init() function, which runs when the endpoint is created or updated, the message Updated successfully will be in the logs. Retrieve the logs by running:

    ml_client.online_deployments.get_logs(
        name="blue", endpoint_name=endpoint_name, lines=50
    )
    

The begin_create_or_update method also works with local deployments. Use the same method with the local=True flag.

Note

The update to the deployment in this section is an example of an in-place rolling update.

  • For a managed online endpoint, the deployment is updated to the new configuration with 20% nodes at a time. That is, if the deployment has 10 nodes, 2 nodes at a time are updated.
  • For a Kubernetes online endpoint, the system iteratively creates a new deployment instance with the new configuration and deletes the old one.
  • For production usage, you should consider blue-green deployment, which offers a safer alternative for updating a web service.

(Optional) Configure autoscaling

Autoscale automatically runs the right amount of resources to handle the load on your application. Managed online endpoints support autoscaling through integration with the Azure monitor autoscale feature. To configure autoscaling, see How to autoscale online endpoints.

(Optional) Monitor SLA by using Azure Monitor

To view metrics and set alerts based on your SLA, complete the steps that are described in Monitor online endpoints.

(Optional) Integrate with Log Analytics

The get-logs command for CLI or the get_logs method for SDK provides only the last few hundred lines of logs from an automatically selected instance. However, Log Analytics provides a way to durably store and analyze logs. For more information on using logging, see Monitor online endpoints.

Delete the endpoint and the deployment

Delete the endpoint and all its underlying deployments:

ml_client.online_endpoints.begin_delete(name=endpoint_name)