Strategies for model deployment into production
Our primary deployment artifact in this study is a model (a binary file or a set of files in general) and related inference (scoring) code. We make the following assumptions:
- A model has been trained in Azure Machine Learning service and is available in the Model Repository in the development environment.
- Several models may need to be used in the inference pipeline. Therefore, no limits are placed on execution time for the inference workload.
- The Production environment will likely be different from the development environment. Also, some services may not be available in the development environment.
Technologies-related matrix
Different solutions are often used for training and hosting AI models. For example, models trained on the Azure ML may not be run on Azure ML for inferencing.
The following table summarizes options that can work for an Azure ML-managed deployment approach and for a custom deployment approach. The table below summarizes this comparison: (Click title link to learn more about each type of scoring.)
All our inferencing workloads can be divided into three separate scoring groups:
Scoring type | Managed by Azure ML | Custom deployment |
---|---|---|
Real-time scoring | Azure Kubernetes Service (AKS)/Arc Kubernetes, Azure ML Online Endpoints | Azure Functions, Azure Apps, Azure Container Instances (ACI), Unmanaged Kubernetes/Azure Container Apps/IoT Edge |
Near real-time scoring | N/A | Azure Durable Functions, AKS with Keda/a Queue service/Triton/Azure Function runtime |
Batch scoring | Azure ML Pipelines, Batch Scoring Endpoint | Batch Scoring using Databricks More options like KubeFlow or even Durable Azure Functions are possible, but they are not common for batch scoring with more limitations and much complexity |
Deployment options
Different options are available that have their own advantages. (See scoring links above.)
Production vs. development subscriptions
It is a common situation that development and training take place in one subscription, but deployment of the inferencing service into production is taking place in another subscription. NOTE: Batch Online Endpoint and Online Endpoints cannot be deployed outside the subscription where Azure ML is located. Thus, if you want to use Online Endpoints, you need to deploy a separate instance of Azure ML Workspace, copy your model into this workspace during the deployment and execute the deployment from there. A separate Azure ML Workspace is required for Azure ML Pipelines as well.
Deployment summary
Batch Inferencing | Near Real-time | Real-time | |
---|---|---|---|
Azure ML Pipelines | Recommended approach for any workload | NA | NA |
Batch online endpoints | Single step inferencing workloads | NA | NA |
Databricks | Custom Workloads integrated with MLFlow and other Spark features | NA | NA |
Managed AKS | NA | NA | Effective way when customer wants to manage and control infrastructure. AKS can be in a different subscription |
Azure Container Instances | NA | NA | Efficient way to deploy approved, tested, and versioned images/environments regardless of trained and deployed subscriptions |
Online endpoints | NA | NA | Recommended way for real-time scoring. Infrastructure managed by Azure ML. Simple deployment process with scaling abilities based on Azure ML Compute |
Azure Functions | NA | NA | Simple workloads and small models, but Azure ML is not required |
Azure Durable Functions | NA | CPU-based workloads, small models | NA |
Unmanaged AKS with KEDA | NA | Custom workloads with any complexity, but deep tech knowledge is required | Can be used for custom images when Azure ML is not available |
For more information
More generalized content outlining different Azure compute options available can be found here: