Online endpoints

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

After you train a machine learning model, you need to deploy it so that others can consume its predictions. Such execution mode of a model is called inference. Azure Machine Learning uses the concept of endpoints and deployments for machine learning models inference.

Online endpoints are endpoints that are used for online (real-time) inferencing. They deploy models under a web server that can return predictions under the HTTP protocol.

The following diagram shows an online endpoint that has two deployments, 'blue' and 'green'. The blue deployment uses VMs with a CPU SKU, and runs version 1 of a model. The green deployment uses VMs with a GPU SKU, and uses version 2 of the model. The endpoint is configured to route 90% of incoming traffic to the blue deployment, while green receives the remaining 10%.

Diagram showing an endpoint splitting traffic to two deployments.

Online deployments requirements

To create an online endpoint, you need to specify the following elements:

  • Model to deploy
  • Scoring script - code needed to do scoring/inferencing
  • Environment - a Docker image with Conda dependencies, or a dockerfile
  • Compute instance & scale settings

Learn how to deploy online endpoints from the CLI/SDK and the studio web portal.

Test and deploy locally for faster debugging

Deploy locally to test your endpoints without deploying to the cloud. Azure Machine Learning creates a local Docker image that mimics the Azure Machine Learning image. Azure Machine Learning will build and run deployments for you locally, and cache the image for rapid iterations.

Native blue/green deployment

Recall, that a single endpoint can have multiple deployments. The online endpoint can do load balancing to give any percentage of traffic to each deployment.

Traffic allocation can be used to do safe rollout blue/green deployments by balancing requests between different instances.

Tip

A request can bypass the configured traffic load balancing by including an HTTP header of azureml-model-deployment. Set the header value to the name of the deployment you want the request to route to.

Screenshot showing slider interface to set traffic allocation between deployments.

Diagram showing an endpoint splitting traffic to two deployments.

Traffic to one deployment can also be mirrored (or copied) to another deployment. Mirroring traffic (also called shadowing) is useful when you want to test for things like response latency or error conditions without impacting live clients; for example, when implementing a blue/green deployment where 100% of the traffic is routed to blue and 10% is mirrored to the green deployment. With mirroring, the results of the traffic to the green deployment aren't returned to the clients but metrics and logs are collected. Testing the new deployment with traffic mirroring/shadowing is also known as shadow testing, and the functionality is currently a preview feature.

Diagram showing an endpoint mirroring traffic to a deployment.

Learn how to safely rollout to online endpoints.

Application Insights integration

All online endpoints integrate with Application Insights to monitor SLAs and diagnose issues.

However managed online endpoints also include out-of-box integration with Azure Logs and Azure Metrics.

Security

  • Authentication: Key and Azure Machine Learning Tokens
  • Managed identity: User assigned and system assigned
  • SSL by default for endpoint invocation

Autoscaling

Autoscale automatically runs the right amount of resources to handle the load on your application. Managed endpoints support autoscaling through integration with the Azure monitor autoscale feature. You can configure metrics-based scaling (for instance, CPU utilization >70%), schedule-based scaling (for example, scaling rules for peak business hours), or a combination.

Screenshot showing that autoscale flexibly provides between min and max instances, depending on rules.

Visual Studio Code debugging

Visual Studio Code enables you to interactively debug endpoints.

Screenshot of endpoint debugging in VS Code.

Private endpoint support

Optionally, you can secure communication with a managed online endpoint by using private endpoints.

You can configure security for inbound scoring requests and outbound communications with the workspace and other services separately. Inbound communications use the private endpoint of the Azure Machine Learning workspace. Outbound communications use private endpoints created per deployment.

For more information, see Secure online endpoints.

Managed online endpoints vs Kubernetes online endpoints

There are two types of online endpoints: managed online endpoints and Kubernetes online endpoints.

Managed online endpoints help to deploy your ML models in a turnkey manner. Managed online endpoints work with powerful CPU and GPU machines in Azure in a scalable, fully managed way. Managed online endpoints take care of serving, scaling, securing, and monitoring your models, freeing you from the overhead of setting up and managing the underlying infrastructure. The main example in this doc uses managed online endpoints for deployment.

Kubernetes online endpoint allows you to deploy models and serve online endpoints at your fully configured and managed Kubernetes cluster anywhere,with CPUs or GPUs.

The following table highlights the key differences between managed online endpoints and Kubernetes online endpoints.

Managed online endpoints Kubernetes online endpoints
Recommended users Users who want a managed model deployment and enhanced MLOps experience Users who prefer Kubernetes and can self-manage infrastructure requirements
Node provisioning Managed compute provisioning, update, removal User responsibility
Node maintenance Managed host OS image updates, and security hardening User responsibility
Cluster sizing (scaling) Managed manual and autoscale, supporting additional nodes provisioning Manual and autoscale, supporting scaling the number of replicas within fixed cluster boundaries
Compute type Managed by the service Customer-managed Kubernetes cluster (Kubernetes)
Managed identity Supported Supported
Virtual Network (VNET) Supported via managed network isolation User responsibility
Out-of-box monitoring & logging Azure Monitor and Log Analytics powered (includes key metrics and log tables for endpoints and deployments) User responsibility
Logging with Application Insights (legacy) Supported Supported
View costs Detailed to endpoint / deployment level Cluster level
Cost applied to VMs assigned to the deployments VMs assigned to the cluster
Mirrored traffic Supported (preview) Unsupported
No-code deployment Supported (MLflow and Triton models) Supported (MLflow and Triton models)

Managed online endpoints

Managed online endpoints can help streamline your deployment process. Managed online endpoints provide the following benefits over Kubernetes online endpoints:

  • Managed infrastructure

    • Automatically provisions the compute and hosts the model (you just need to specify the VM type and scale settings)
    • Automatically updates and patches the underlying host OS image
    • Automatic node recovery if there's a system failure
  • Monitoring and logs

    Screenshot showing Azure Monitor graph of endpoint latency.

  • View costs

    Screenshot cost chart of an endpoint and deployment.

    Note

    Managed online endpoints are based on Azure Machine Learning compute. When using a managed online endpoint, you pay for the compute and networking charges. There is no additional surcharge.

    If you use a virtual network and secure outbound (egress) traffic from the managed online endpoint, there is an additional cost. For egress, three private endpoints are created per deployment for the managed online endpoint. These are used to communicate with the default storage account, Azure Container Registry, and workspace. Additional networking charges may apply. For more information on pricing, see the Azure pricing calculator.

For a step-by-step tutorial, see How to deploy online endpoints.

Next steps