Model deployment into Production Strategies

Our primary deployment artifact in this study is a model (a binary file or a set of files in general) and related inference (scoring) code. We make the following assumptions:

  • A model has been trained in Azure Machine Learning service and is available in the Model Repository in the development environment.
  • Several models may need to be used in the inference pipeline. Therefore, no limits are placed on execution time for the inference workload.
  • The Production environment will likely be different from the development environment. Also, some services may not be available in the development environment.

Different solutions are often used for training and hosting AI models. For example, models trained on the Azure ML may not be run on Azure ML for inferencing.

The following table summarizes options that can work for an Azure ML-managed deployment approach and for a custom deployment approach. The table below summarizes this comparison: (Click title link to learn more about each type of scoring.)

All our inferencing workloads can be divided into three separate scoring groups:

Scoring type Managed by Azure ML Custom deployment
Real-time scoring Azure Kubernetes Service (AKS)/Arc Kubernetes, Azure ML Online Endpoints Azure Functions, Azure Apps, Azure Container Instances (ACI), Unmanaged Kubernetes/Azure Container Apps/IoT Edge
Near real-time scoring N/A Azure Durable Functions, AKS with Keda/a Queue service/Triton/Azure Function runtime
Batch scoring Azure ML Pipelines, Batch Scoring Endpoint Batch Scoring using Databricks

More options like KubeFlow or even Durable Azure Functions are possible, but they are not common for batch scoring with more limitations and much complexity

Deployment options

Different options are available that have their own advantages. (See scoring links above.)

Production vs. development subscriptions

It is a common situation that development and training take place in one subscription, but deployment of the inferencing service into production is taking place in another subscription. NOTE: Batch Online Endpoint and Online Endpoints cannot be deployed outside the subscription where Azure ML is located. Thus, if you want to use Online Endpoints, you need to deploy a separate instance of Azure ML Workspace, copy your model into this workspace during the deployment and execute the deployment from there. A separate Azure ML Workspace is required for Azure ML Pipelines as well.

Deployment summary

Batch Inferencing Near Real-time Real-time
Azure ML Pipelines Recommended approach for any workload NA NA
Batch online endpoints Single step inferencing workloads NA NA
Databricks Custom Workloads integrated with MLFlow and other Spark features NA NA
Managed AKS NA NA Effective way when customer wants to manage and control infrastructure. AKS can be in a different subscription
Azure Container Instances NA NA Efficient way to deploy approved, tested, and versioned images/environments regardless of trained and deployed subscriptions
Online endpoints NA NA Recommended way for real-time scoring. Infrastructure managed by Azure ML. Simple deployment process with scaling abilities based on Azure ML Compute
Azure Functions NA NA Simple workloads and small models, but Azure ML is not required
Azure Durable Functions NA CPU-based workloads, small models NA
Unmanaged AKS with KEDA NA Custom workloads with any complexity, but deep tech knowledge is required Can be used for custom images when Azure ML is not available

For more information

More generalized content outlining different Azure compute options available can be found here: