AI model deployment

Article
11/19/2024

Deploying a model means taking a trained ML model, packaging it (like as a container image or a pipeline), and setting it up for inference. Deployment also involves packaging the data prep code with the model to ensure incoming data matches what the model expects.

💡Key Outcomes of Deployment:

Performance and parallel processing improvements are implemented against the best available compute environment.
Monitoring and observability consistent with the entire solution have been implemented.
The target deployment endpoint is available to other services within the solution. Security, authentication and access controls are implemented.
Integration and system tests are in place.
A release pipeline per environment, bound to source control is implemented.
Batch or runtime scoring monitoring is in place to track the model’s performance over time.
User interactions with the model are logged for future improvement

Deployment Infrastructure

There are several ways to deploy custom ML solutions in the cloud or on the edge. The following options are common Azure-Compatible deployment options:

Cloud-based Deployment	Edge-based Deployment
Azure Kubernetes Service	Kubernetes clusters and Azure ARC
Azure AI Serverless Endpoints
Azure Batch Endpoints
Azure Managed Endpoints
Azure Functions

While there are other ways to deploy Docker containers to edge devices, we will focus on the mechanism mentioned above.

Model Flighting

Details on Model Flighting can be found in the Azure ML official docs

Model Release

Model Release involves the process of packaging the latest (and ideally best performing) model from the training pipeline, and promoting it through to the Production environment.

Model packaging options

Model processing pipeline for generating a docker image

After a model is determined through experimentation by a data scientist, it's time to deploy to an environment. Although it is possible to deploy a model along with its artifacts directly to an environment, a better practice is to configure a docker image with model artifacts. Then, run containers based on the docker image. Docker helps in providing more flexibility to test the model including security scanning, smoke test and publishing it to a container registry.

Model promotion using shared registries

The Machine Learning Registries for MLOps in Azure ML allows you to register a model once, and easily retrieve it across multiple workspaces (including in different subscriptions). Instead of copying a model to workspaces in each deployment environment, each of those workspaces could refer to the same registry. Then, as a model is tested, its tags are updated to reflect its status. Models ready for production can be deployed directly from the shared registry to the production environment.

Automating ML model deployment to production environments

In general, there are 2 paths to deploying ML code into Production environments:

Implement CI/CD pipelines on your development branch to deploy QA and Production environments
Use several branches to move code from development to QA and Production environments

Any of the deployment options can fit the proposed deployment workflow.

CI/CD on the development branch

This approach is the most common, where we prepare and store the required artifacts (models and libraries) in advance. To indicate their readiness for production, we assign a special tag to the approved artifacts.

For example, if we execute training from our development branch on the full dataset and we are generating a model, the model should be evaluated and approved. Once the model is ready, we can label it with a special attribute, like a "production tag." Then, during the continuous deployment (CD) phase, we use the latest model version with the production tag for deployment. We don't care much about our code in the development branch because we are not planning to use it in production. We only use it to do training in our development environment.

Rollback strategy can be implemented with no challenges: we just need to remove tag from the latest model and execute the CI/CD again to pick up the previous version that was ready for production.

Stages

In above implementation, it's OK to use just one branch (main, for example) as the primary source for development and executing CI/CD.

The following image demonstrates the final version of the CI/CD process:

CI/CD

An example implementation of a CI pipeline for training in Azure ML can be found here in this basic template. This example pipeline executes several steps when a PR is merged into the development branch:

The build_validation_pipeline.yml pipeline executes, running unit tests and code validation (such as flake8).
The second stage executes several steps to kick off an Azure ML Pipeline
1. configure_azureml_agent.yml installs pip requirements and uses the Azure CLI extension to install the Azure ML CLI
2. connect_to_workspace.yml uses pipeline variables to connect to the correct Azure ML workspace
3. create_compute.yml ensures that there is available Azure ML compute resources to execute the training pipeline
4. execute_no_wait_job.yml uses the Azure ML CLI to deploy and trigger the Azure ML pipeline defined by the amlJobExecutionScript variable. In this case, it is ./mlops/nyc-taxi/pipeline.yml.
  - This step does not wait for the Azure ML pipeline to complete, as long running pipelines would hold the DevOps build agent. However, the execute_and_wait_job.yml step is available instead for scenarios where training may be quick, and quickly identifying failure is critical. In the pr_to_dev_pipeline.yml, the wait job is used for this reason

The full template repo is available in the following repo.

There are more templates based on a similar pattern available at:

Multi-branch strategy

You can use several branches to move your code from development to production. This strategy works fine if you deploy the code itself, and don’t update the production environment often. Also you have several versions of the production environment for different partners or departments.

We cannot treat the development branch as a production ready branch. The branch is stable but moving it to production "as is" might cause some issues. For example, we could have updated training pipelines and no model exists based on it yet. Another example, somebody could activate a wrong model to be used in scoring service that is not critical for development branch, but critical for production.

Hence, we propose having a flow moving development branch to production using two stages:

Once we, data scientists, believe that the current version of the development branch is ready for production, we move code alongside the current model to a special QA branch. This branch is going to be used for testing to validate that our code and the models work fine. Additionally, we can test our deployment scripts there.
Once all tests on the QA branch have been completed successfully, we can move code and the models from the QA branch to the production one.

We use branches rather than tags because it allows us to execute some more DevOps pipelines prior to commit code from one branch to another. In these pipelines, we can implement an approval process and move ML pipelines between environments.

The PR Build on the movement from the development to the QA branch should include the following tasks:

Deploy needed infrastructure for production. In most cases, you don’t need to deploy your pipelines since you are not doing training in the production environment. So, you can deploy just scoring infrastructure.
Copy all latest approved models (approved models can be tagged).

Branches

Implementing this process, we are working across several branches:

dev: the development branch as the default stable branch for developers
QA: the QA branch to test scoring pipeline and the model
main: the source branch for production environment

Once we finish testing for our QA environment, we can create another PR and start a process to move everything to main. The PR should be approved, and it will trigger the deployment Build. The Build has to update scoring infrastructure in the production environment and clone our model again.

Production

Deployment Artifacts

Deployment artifacts in CI/CD process depend on the model usage in the production environment. The most common usage scenarios are batch scoring and runtime scoring:

Batch scoring: In this case, we use the model together with another ML pipeline to run it under the Machine Learning Pipeline engine umbrella (like Azure ML) that has been used for training.
Runtime scoring: In this case, we need a service to serve our model (for example, Azure Functions, Azure Kubernetes Service, Azure ML Online Endpoints, locally run Docker container), pack the model to an image alongside with all needed components and make the deployment.
- You may also register artifacts such as Azure ML Environments, Pipeline Components, and models, in a Registry in an Azure ML Workspace.

Each scenario defines what components you need in the QA and Production environments.

Batch Scoring

For batch scoring, model registry and compute pipeline resources must be deployed to infrastructure, such as an Azure ML Workspace, or a Databricks instance. These resources will be used to trigger scoring pipelines (which need to be managed as outlined in ML Pipelines). Implementing a CD pipeline for the production environment generally consists of two core steps: copy the latest ready for production model to QA/Prod environment, and publish the scoring service to the appropriate scoring pipeline (for example, Azure ML Pipeline, or Databricks pipeline).

Batch scoring

Alternatively, using the Azure ML registries preview, multiple Workspaces are able to access shared models, components, and environments from a single registry - removing the need to copy the model between each Workspace. These shared registries work even across multiple Azure Subscriptions.

Runtime Scoring

For a runtime service, a replicated model registry service is an optional component in the QA/Prod environment, and it depends on how you are planning to use the model. Depending on the scenario, several methods have been successful:

Using the model as a separate entity in any kind of external application: copy the model to a known location in QA/Prod environment to make it available for all consumers. A separate QA/Prod model registry is not needed in this scenario
Deploying a model as a part of a custom image to serve in any service that is not connected to your model registry (or Azure ML): create a Docker image and deploy it to the desired service. If you are using Azure ML, you can re-use your Azure Container Registry (ACR) instance in the Dev environment, or a separate container registry.
Azure Kubernetes Service (AKS) that is managed by an Azure ML service: replicate Azure ML Workspace and AKS in each environment to follow the best security practices making sure that the AKS instances in each environment are isolated from one another.

Runtime scoring

Different solutions are often used for training and hosting AI models. For example, models trained on the Azure ML may not be run on Azure ML for inferencing.

The following table summarizes options that can work for an Azure ML-managed deployment approach and for a custom deployment approach. The table below summarizes this comparison: (Click title link to learn more about each type of scoring.)

All our inferencing workloads can be divided into three separate scoring groups:

Scoring type	Managed by Azure ML	Custom deployment
Real-time scoring	Online Endpoints	Azure Functions, Self-hosted web API (e.g. Azure Web App, or Kubernetes)
Near real-time scoring	N/A	Azure Durable Functions, AKS with Keda/a Queue service/Triton/Azure Function runtime
Batch scoring	Batch Endpoints	Batch Scoring using Databricks, Self-Hosted Pipelines such as Airflow More options like KubeFlow or even Durable Azure Functions are possible, but they are not common for batch scoring with more limitations and complexity

Production vs. development subscriptions

It is a common situation that development and training take place in one subscription, but deployment of the inferencing service into production is taking place in another subscription. NOTE: Batch Online Endpoint and Online Endpoints cannot be deployed outside the subscription where Azure ML is located. Thus, if you want to use Online Endpoints, you need to deploy a separate instance of Azure ML Workspace, copy your model into this workspace during the deployment and execute the deployment from there. A separate Azure ML Workspace is required for Azure ML Pipelines as well.

For more information

More generalized content outlining different Azure compute options available can be found here:

Azure compute decision tree doc.
How to deploy managed online endpoints with Azure CLI
How to deploy with Triton using Azure CLI
For an in-depth exploration of using Azure Machine Learning to deploy models, see the Azure ML Deployment Trade Study document in this repository.
Official Azure ML docs on deployment:
- Endpoints for Inferencing

Share via

AI model deployment

Deployment Infrastructure

Model Flighting