Online endpoints and deployments for real-time inference

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

Azure Machine Learning allows you to perform real-time inferencing on data by using models that are deployed to online endpoints. Inferencing is the process of applying new input data to a machine learning model to generate outputs. While these outputs are typically referred to as "predictions," inferencing can be used to generate outputs for other machine learning tasks, such as classification and clustering.

Online endpoints

Online endpoints deploy models to a web server that can return predictions under the HTTP protocol. Use online endpoints to operationalize models for real-time inference in synchronous low-latency requests. We recommend using them when:

  • You have low-latency requirements
  • Your model can answer the request in a relatively short amount of time
  • Your model's inputs fit on the HTTP payload of the request
  • You need to scale up in terms of number of requests

To define an endpoint, you need to specify:

  • Endpoint name: This name must be unique in the Azure region. For more information on the naming rules, see endpoint limits.
  • Authentication mode: You can choose between key-based authentication mode and Azure Machine Learning token-based authentication mode for the endpoint. A key doesn't expire, but a token does expire. For more information on authenticating, see Authenticate to an online endpoint.

Azure Machine Learning provides the convenience of using managed online endpoints for deploying your ML models in a turnkey manner. This is the recommended way to use online endpoints in Azure Machine Learning. Managed online endpoints work with powerful CPU and GPU machines in Azure in a scalable, fully managed way. These endpoints also take care of serving, scaling, securing, and monitoring your models, to free you from the overhead of setting up and managing the underlying infrastructure. To learn how to deploy to a managed online endpoint, see Deploy an ML model with an online endpoint.

Why choose managed online endpoints over ACI or AKS(v1)?

Use of managed online endpoints is the recommended way to use online endpoints in Azure Machine Learning. The following table highlights the key attributes of managed online endpoints compared to Azure Machine Learning SDK/CLI v1 solutions (ACI and AKS(v1)).

Attributes Managed online endpoints (v2) ACI or AKS(v1)
Network security/isolation Easy inbound/outbound control with quick toggle Virtual network not supported or requires complex manual configuration
Managed service - Fully managed compute provisioning/scaling​
- Network configuration for data exfiltration prevention​
- Host OS upgrade, controlled rollout of in-place updates
- Scaling is limited in v1
- Network configuration or upgrade needs to be managed by user
Endpoint/deployment concept Distinction between endpoint and deployment enables complex scenarios such as safe rollout of models No concept of endpoint
Diagnostics and Monitoring - Local endpoint debugging possible with Docker and Visual Studio Code
​ - Advanced metrics and logs analysis with chart/query to compare between deployments​
- Cost breakdown down to deployment level
No easy local debugging
Scalability Limitless, elastic, and automatic scaling - ACI is non-scalable​
- AKS (v1) supports in-cluster scale only and requires scalability configuration
Enterprise readiness Private link, customer managed keys, Microsoft Entra ID, quota management, billing integration, SLA Not supported
Advanced ML features - Model data collection
- Model monitoring​
- Champion-challenger model, safe rollout, traffic mirroring
- Responsible AI extensibility
Not supported

Alternatively, if you prefer to use Kubernetes to deploy your models and serve endpoints, and you're comfortable with managing infrastructure requirements, you can use Kubernetes online endpoints. These endpoints allow you to deploy models and serve online endpoints at your fully configured and managed Kubernetes cluster anywhere, with CPUs or GPUs.

Why choose managed online endpoints over AKS(v2)?

Managed online endpoints can help streamline your deployment process and provide the following benefits over Kubernetes online endpoints:

  • Managed infrastructure

    • Automatically provisions the compute and hosts the model (you just need to specify the VM type and scale settings)
    • Automatically updates and patches the underlying host OS image
    • Automatically performs node recovery if there's a system failure
  • Monitoring and logs

    Screenshot showing Azure Monitor graph of endpoint latency.

  • View costs

    Screenshot cost chart of an endpoint and deployment.

    Note

    Managed online endpoints are based on Azure Machine Learning compute. When using a managed online endpoint, you pay for the compute and networking charges. There is no additional surcharge. For more information on pricing, see the Azure pricing calculator.

    If you use an Azure Machine Learning virtual network to secure outbound traffic from the managed online endpoint, you're charged for the Azure private link and FQDN outbound rules that are used by the managed virtual network. For more information, see Pricing for managed virtual network.

Managed online endpoints vs kubernetes online endpoints

The following table highlights the key differences between managed online endpoints and Kubernetes online endpoints.

Managed online endpoints Kubernetes online endpoints (AKS(v2))
Recommended users Users who want a managed model deployment and enhanced MLOps experience Users who prefer Kubernetes and can self-manage infrastructure requirements
Node provisioning Managed compute provisioning, update, removal User responsibility
Node maintenance Managed host OS image updates, and security hardening User responsibility
Cluster sizing (scaling) Managed manual and autoscale, supporting additional nodes provisioning Manual and autoscale, supporting scaling the number of replicas within fixed cluster boundaries
Compute type Managed by the service Customer-managed Kubernetes cluster (Kubernetes)
Managed identity Supported Supported
Virtual Network (VNET) Supported via managed network isolation User responsibility
Out-of-box monitoring & logging Azure Monitor and Log Analytics powered (includes key metrics and log tables for endpoints and deployments) User responsibility
Logging with Application Insights (legacy) Supported Supported
View costs Detailed to endpoint / deployment level Cluster level
Cost applied to VMs assigned to the deployments VMs assigned to the cluster
Mirrored traffic Supported Unsupported
No-code deployment Supported (MLflow and Triton models) Supported (MLflow and Triton models)

Online deployments

A deployment is a set of resources and computes required for hosting the model that does the actual inferencing. A single endpoint can contain multiple deployments with different configurations. This setup helps to decouple the interface presented by the endpoint from the implementation details present in the deployment. An online endpoint has a routing mechanism that can direct requests to specific deployments in the endpoint.

The following diagram shows an online endpoint that has two deployments, blue and green. The blue deployment uses VMs with a CPU SKU, and runs version 1 of a model. The green deployment uses VMs with a GPU SKU, and runs version 2 of the model. The endpoint is configured to route 90% of incoming traffic to the blue deployment, while the green deployment receives the remaining 10%.

Diagram showing an endpoint splitting traffic to two deployments.

The following table describes the key attributes of a deployment:

Attribute Description
Name The name of the deployment.
Endpoint name The name of the endpoint to create the deployment under.
Model The model to use for the deployment. This value can be either a reference to an existing versioned model in the workspace or an inline model specification.
Code path The path to the directory on the local development environment that contains all the Python source code for scoring the model. You can use nested directories and packages.
Scoring script The relative path to the scoring file in the source code directory. This Python code must have an init() function and a run() function. The init() function will be called after the model is created or updated (you can use it to cache the model in memory, for example). The run() function is called at every invocation of the endpoint to do the actual scoring and prediction.
Environment The environment to host the model and code. This value can be either a reference to an existing versioned environment in the workspace or an inline environment specification. Note: Microsoft regularly patches the base images for known security vulnerabilities. You'll need to redeploy your endpoint to use the patched image. If you provide your own image, you're responsible for updating it. For more information, see Image patching.
Instance type The VM size to use for the deployment. For the list of supported sizes, see Managed online endpoints SKU list.
Instance count The number of instances to use for the deployment. Base the value on the workload you expect. For high availability, we recommend that you set the value to at least 3. We reserve an extra 20% for performing upgrades. For more information, see virtual machine quota allocation for deployments.

To learn how to deploy online endpoints using the CLI, SDK, studio, and ARM template, see Deploy an ML model with an online endpoint.

Deployment for coders and non-coders

Azure Machine Learning supports model deployment to online endpoints for coders and non-coders alike, by providing options for no-code deployment, low-code deployment, and Bring Your Own Container (BYOC) deployment.

  • No-code deployment provides out-of-box inferencing for common frameworks (for example, scikit-learn, TensorFlow, PyTorch, and ONNX) via MLflow and Triton.
  • Low-code deployment allows you to provide minimal code along with your ML model for deployment.
  • BYOC deployment lets you virtually bring any containers to run your online endpoint. You can use all the Azure Machine Learning platform features such as autoscaling, GitOps, debugging, and safe rollout to manage your MLOps pipelines​.

The following table highlights key aspects about the online deployment options:

No-code Low-code BYOC
Summary Uses out-of-box inferencing for popular frameworks such as scikit-learn, TensorFlow, PyTorch, and ONNX, via MLflow and Triton. For more information, see Deploy MLflow models to online endpoints. Uses secure, publicly published curated images for popular frameworks, with updates every two weeks to address vulnerabilities. You provide scoring script and/or Python dependencies. For more information, see Azure Machine Learning Curated Environments. You provide your complete stack via Azure Machine Learning's support for custom images. For more information, see Use a custom container to deploy a model to an online endpoint.
Custom base image No, curated environment will provide this for easy deployment. Yes and No, you can either use curated image or your customized image. Yes, bring an accessible container image location (for example, docker.io, Azure Container Registry (ACR), or Microsoft Container Registry (MCR)) or a Dockerfile that you can build/push with ACR​ for your container.
Custom dependencies No, curated environment will provide this for easy deployment. Yes, bring the Azure Machine Learning environment in which the model runs; either a Docker image with Conda dependencies, or a dockerfile​. Yes, this will be included in the container image.
Custom code No, scoring script will be autogenerated for easy deployment. Yes, bring your scoring script. Yes, this will be included in the container image.

Note

AutoML runs create a scoring script and dependencies automatically for users, so you can deploy any AutoML model without authoring additional code (for no-code deployment) or you can modify auto-generated scripts to your business needs (for low-code deployment).​ To learn how to deploy with AutoML models, see Deploy an AutoML model with an online endpoint.

Online endpoint debugging

Azure Machine Learning provides various ways to debug online endpoints locally and by using container logs.

Local debugging with the Azure Machine Learning inference HTTP server

You can debug your scoring script locally by using the Azure Machine Learning inference HTTP server. The HTTP server is a Python package that exposes your scoring function as an HTTP endpoint and wraps the Flask server code and dependencies into a singular package. It's included in the prebuilt Docker images for inference that are used when deploying a model with Azure Machine Learning. Using the package alone, you can deploy the model locally for production, and you can also easily validate your scoring (entry) script in a local development environment. If there's a problem with the scoring script, the server will return an error and the location where the error occurred. You can also use Visual Studio Code to debug with the Azure Machine Learning inference HTTP server.

To learn more about debugging with the HTTP server, see Debugging scoring script with Azure Machine Learning inference HTTP server.

Local debugging

For local debugging, you need a local deployment; that is, a model that is deployed to a local Docker environment. You can use this local deployment for testing and debugging before deployment to the cloud. To deploy locally, you'll need to have the Docker Engine installed and running. Azure Machine Learning then creates a local Docker image that mimics the Azure Machine Learning image. Azure Machine Learning will build and run deployments for you locally and cache the image for rapid iterations.

The steps for local debugging typically include:

  • Checking that the local deployment succeeded
  • Invoking the local endpoint for inferencing
  • Reviewing the logs for output of the invoke operation

To learn more about local debugging, see Deploy and debug locally by using local endpoints.

Local debugging with Visual Studio Code (preview)

Important

This feature is currently in public preview. This preview version is provided without a service-level agreement, and we don't recommend it for production workloads. Certain features might not be supported or might have constrained capabilities.

For more information, see Supplemental Terms of Use for Microsoft Azure Previews.

As with local debugging, you first need to have the Docker Engine installed and running and then deploy a model to the local Docker environment. Once you have a local deployment, Azure Machine Learning local endpoints use Docker and Visual Studio Code development containers (dev containers) to build and configure a local debugging environment. With dev containers, you can take advantage of Visual Studio Code features, such as interactive debugging, from inside a Docker container.

To learn more about interactively debugging online endpoints in VS Code, see Debug online endpoints locally in Visual Studio Code.

Debugging with container logs

For a deployment, you can't get direct access to the VM where the model is deployed. However, you can get logs from some of the containers that are running on the VM. There are two types of containers that you can get the logs from:

  • Inference server: Logs include the console log (from the inference server) which contains the output of print/logging functions from your scoring script (score.py code).
  • Storage initializer: Logs contain information on whether code and model data were successfully downloaded to the container. The container runs before the inference server container starts to run.

To learn more about debugging with container logs, see Get container logs.

Traffic routing and mirroring to online deployments

Recall that a single online endpoint can have multiple deployments. As the endpoint receives incoming traffic (or requests), it can route percentages of traffic to each deployment, as used in the native blue/green deployment strategy. It can also mirror (or copy) traffic from one deployment to another, also called traffic mirroring or shadowing.

Traffic routing for blue/green deployment

Blue/green deployment is a deployment strategy that allows you to roll out a new deployment (the green deployment) to a small subset of users or requests before rolling it out completely. The endpoint can implement load balancing to allocate certain percentages of the traffic to each deployment, with the total allocation across all deployments adding up to 100%.

Tip

A request can bypass the configured traffic load balancing by including an HTTP header of azureml-model-deployment. Set the header value to the name of the deployment you want the request to route to.

The following image shows settings in Azure Machine Learning studio for allocating traffic between a blue and green deployment.

Screenshot showing slider interface to set traffic allocation between deployments.

This traffic allocation routes traffic as shown in the following image, with 10% of traffic going to the green deployment, and 90% of traffic going to the blue deployment.

Diagram showing an endpoint splitting traffic to two deployments.

Traffic mirroring to online deployments

The endpoint can also mirror (or copy) traffic from one deployment to another deployment. Traffic mirroring (also called shadow testing) is useful when you want to test a new deployment with production traffic without impacting the results that customers are receiving from existing deployments. For example, when implementing a blue/green deployment where 100% of the traffic is routed to blue and 10% is mirrored to the green deployment, the results of the mirrored traffic to the green deployment aren't returned to the clients, but the metrics and logs are recorded.

Diagram showing an endpoint mirroring traffic to a deployment.

To learn how to use traffic mirroring, see Safe rollout for online endpoints.

More capabilities of online endpoints in Azure Machine Learning

Authentication and encryption

  • Authentication: Key and Azure Machine Learning Tokens
  • Managed identity: User assigned and system assigned
  • SSL by default for endpoint invocation

Autoscaling

Autoscale automatically runs the right amount of resources to handle the load on your application. Managed endpoints support autoscaling through integration with the Azure monitor autoscale feature. You can configure metrics-based scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling rules for peak business hours), or a combination.

Screenshot showing that autoscale flexibly provides between min and max instances, depending on rules.

To learn how to configure autoscaling, see How to autoscale online endpoints.

Managed network isolation

When deploying an ML model to a managed online endpoint, you can secure communication with the online endpoint by using private endpoints.

You can configure security for inbound scoring requests and outbound communications with the workspace and other services separately. Inbound communications use the private endpoint of the Azure Machine Learning workspace. Outbound communications use private endpoints created for the workspace's managed virtual network.

For more information, see Network isolation with managed online endpoints.

Monitoring online endpoints and deployments

Monitoring for Azure Machine Learning endpoints is possible via integration with Azure Monitor. This integration allows you to view metrics in charts, configure alerts, query from log tables, use Application Insights to analyze events from user containers, and so on.

  • Metrics: Use Azure Monitor to track various endpoint metrics, such as request latency, and drill down to deployment or status level. You can also track deployment-level metrics, such as CPU/GPU utilization and drill down to instance level. Azure Monitor allows you to track these metrics in charts and set up dashboards and alerts for further analysis.

  • Logs: Send metrics to the Log Analytics Workspace where you can query logs using the Kusto query syntax. You can also send metrics to Storage Account and/or Event Hubs for further processing. In addition, you can use dedicated Log tables for online endpoint related events, traffic, and container logs. Kusto query allows complex analysis joining multiple tables.

  • Application insights: Curated environments include the integration with Application Insights, and you can enable/disable it when you create an online deployment. Built-in metrics and logs are sent to Application insights, and you can use its built-in features such as Live metrics, Transaction search, Failures, and Performance for further analysis.

For more information on monitoring, see Monitor online endpoints.

Secret injection in online deployments (preview)

Secret injection in the context of an online deployment is a process of retrieving secrets (such as API keys) from secret stores, and injecting them into your user container that runs inside an online deployment. Secrets will eventually be accessible via environment variables, thereby providing a secure way for them to be consumed by the inference server that runs your scoring script or by the inferencing stack that you bring with a BYOC (bring your own container) deployment approach.

There are two ways to inject secrets. You can inject secrets yourself, using managed identities, or you can use the secret injection feature. To learn more about the ways to inject secrets, see Secret injection in online endpoints (preview).

Next steps