Automating and monitoring ML model development

Article
01/13/2025

Our customers build ML capabilities to improve operational efficiency and automation. When scaling these capabilities across an enterprise, developers face the most friction. This solution helps software engineers and data scientists create repeatable ML experiments, reusable code, and scalable deployments of their ML models.

Business problem

In the rapidly evolving field of machine learning, businesses face significant challenges in efficiently developing, deploying, and scaling ML models. The complexity of managing infrastructure and the operationalization of models can lead to prolonged development cycles, increased costs, and missed market opportunities. Moreover, manual processes for provisioning infrastructure and deploying models are prone to errors, lack consistency, and struggle to keep pace with the demands of modern ML applications. These challenges hinder an organization's ability to innovate and apply machine learning effectively, ultimately impacting competitive advantage and return on investment in AI technologies.

Traditional software development is not well suited to the unique requirements of machine learning models. Often, machine learning models are built in a research environment and then manually moved to production, where they need to be maintained. So, manual deployment:

Leads to inefficiencies in deployment and management.
Limits reproducibility and transparency of ML models.
Inhibits collaboration between data science and operations teams.
Reduces the ability to monitor and detect drift in ML models for our customers.

Solution overview

The broader practice of MLOps provides a framework for managing the complete lifecycle of machine learning models, from development to deployment to monitoring and maintenance. This solution uses a combination of CI/CD pipelines, ML pipelines, and cloud-based orchestration to train, retrain, and monitor machine learning models. Alongside these tools, it is important to develop a process for managing the ML lifecycle, as these tools cannot fully enforce proper ML development and operations on their own. Organizations use these tools to build accurate, scalable, and secure ML solutions that deliver business value over time.

Value proposition

For data scientists and data engineers, this solution enables:

Transparent and reproducible ML model training.
Automated model retraining and deployment to reduce human error and inefficiency.
A simplified path from experimentation to deployment of a new model.
Scaled impact of Data Scientists who define and develop ML models.
Standards for governance and security of ML models.

Logical architecture

Mermaid diagram #1

Solution building blocks

This solution is based on building real MLOps solutions and incorporates capabilities.

Stage	Capability	Description
Experimentation	Experimentation in MLOps	Learn how to manage Jupyter Notebooks and Model experimentation in a MLOps framework
	Experimentation in Azure ML	Review an example implementation of a code repo for ML development using Azure ML
Model Development	ML Pipelines	Build a Pipeline to train a model
Deployment	Release Pipelines	Integrate the ML pipeline into a CI/CD pipeline
	Sync - Async MLOps Pattern	Understand when to use async jobs in AzDo for model training
	Model Release	Learn about the various model release options
	Model Deployment in Azure ML	How to deploy a model using Azure ML
	Model Flighting	Use managed online endpoints to flight versions of the model for deployment
ML Lifecycle Management	Drift Monitoring	Understand the basics of Data Drift monitoring and how to implement it

Implementations

This example solution has been implemented in several GitHub code repos.

MLOps Model Factory / DSToolkit-V2

Link to template

Model Factory uses the capabilities of Azure Machine Learning and the operational efficiency of Azure DevOps or GitHub to transform the machine learning lifecycle. It does this transformation by automating the process of infrastructure provisioning and model deployment. These measures ensure that machine learning models are developed, tested and deployed more rapidly, reliably and at scale. This integration:

Reduces the time-to-market for innovative ML-driven applications and services
Enhances the agility of ML teams
Maximizes the return on investment in AI initiatives.

The Model Factory approach to automation utilizes either Bicep or Terraform for infrastructure as code (IaC) and supports both batch and online model endpoints. This automation not only streamlines development workflows but also significantly reduces the risk of human error, ensures compliance, and optimizes operational costs.

Model Factory provides help with:

Enabling transparent and reproducible ML model training for many models.
Automating model retraining and deployment to reduce human error and inefficiency.
Simplifying the path from experimentation to deployment of a new model.
Scaling Model experimentation for Data Scientists who define and develop ML models.
Applying standards for governance and security of ML models.
Automated benchmarking of models with comparison against past model performance.
Automated AML asset management and asset cleanup.
Secure infrastructure IaC deployment samples.
Traceability
- Dataset traceability in AML, linking datasets to the jobs which use them.
- Linking models registered in AML to the jobs which created them.
- Linking child jobs to their parents.
Observability
- MLFlow based model performance and other metrics in AML.
- Model tagging (Production / development tags etc.).

Model Factory provides automation of the following items:

Infrastructure provisioning using either Azure Pipelines or GitHub workflows using either Bicep or Terraform as the IaC (Infrastructure as Code) language.
A CI build triggered upon changes to one or more models.
A CD build and deployment of one or more models to batch and online endpoints.

Core MLOps templates (Azure ML)

These two templates provide the code structure necessary to create a production-level automated model training pipeline. They use Azure Machine Learning and Azure Pipelines (or GitHub Actions) as the core services. Both provide example pipelines, and a folder structure suited to most ML tasks.

MLOps Template for Azure ML CLI v2
Jenkins Implementation: this folder contains files necessary to implement the CLI template, using Jenkins for CI/CD
MLOps Template for Azure ML SDK v2

The SDK (Software Development Kit) based template and CLI (Command-Line Interface) based template are two different approaches to using MLOps templates.

Here are some differences of an SDK-based MLOps template compared to a CLI-based template:

Customization: An SDK-based template allows for greater customization since it provides access to underlying code and allows engineers to modify it to suit their specific needs. This flexibility is useful when creating complex applications requiring advanced features or functionality. For example, the SDK allows conditional steps in a pipeline, and making querying and filtering pipeline results easier.
Integration: An SDK-based template is easier to integrate into other applications, frameworks, and systems since it is built using standard programming languages and libraries. These features make it easier to incorporate into larger projects or environments.
Platform independence: An SDK-based template can be used on multiple platforms, such as Windows, Linux, and macOS, without requiring any modifications. An SDK is designed to be cross-platform and used across different operating systems.
Simplicity: CLI-based templates are typically easier to use since they only require engineers to enter commands into the command-line interface. Developers with less programming experience may find this method more accessible.
Fast prototyping: CLI-based templates can be used for quick prototyping of ideas and do not need complex programming. Proof-of-concept applications or exploring new ideas can benefit from the quick turnaround.
Automation: CLI-based templates can be automated to perform repetitive tasks quickly, such as building and deploying applications. This automation saves engineer's significant amounts of time and effort in the development process.
Non-Python development: Both SDK and CLI-based MLOps templates are built on top of Azure Machine Learning provider REST API. Details of these REST APIs are available at AML REST API. CLI-based templates or a direct REST API should be used for non-python based development.

Some more comparison details are available in the official documentation SDK VS CLI.

The choice between an SDK-based or CLI-based template depends on the needs of the project and the expertise of the engineers involved. Both approaches have their own strengths and engineers should evaluate each option carefully before deciding.

Databricks templates

These templates provide similar capabilities as the Azure ML templates, but use Databricks as the main compute interface for training the machine learning models:

Databricks MLOps Template based on Workflows and Repos

For more information

If it is your first time exploring MLOps, the bulk of this information is meant to explain and explore MLOps; consider reviewing more of the capabilities and guidance before implementing this solution. Good places to start include:

Understanding MLOps Capabilities.
Get started with Azure ML is an overview of the Azure ML service, which is crucial to several implementations.

Additional resources

Documentation

AI Experimentation

In the experimentation phase, data scientists and ML engineers collaborate. They work on exploratory data analysis, prototyping ML approaches, feature engineering,and testing hypothesis.
AI Solutions

A solution is an opinionated engineering approach that brings together a set of capabilities to solve a business problem. It provides guidance, insights, best practices on how to develop a complete functional solution to address an end-to-end business scenario along with code. All solutions listed have been successfully applied and validated by multiple customers.
AI Lifecycle Management

The MLOps process isn't a linear, one-time operation. The AI Lifecycle encapsulates the cyclical nature of this process. Various components also take place across multiple stages of the MLOps process, such as monitoring, data collection, or retraining.
LLMOps with Prompt flow

LLMOps with Prompt flow is a complete platform for developing multiple flows related to LLM applications.
Using experimentation to accelerate RAG development

The RAG Experiment Accelerator is a tool designed to help teams quickly find the best strategies for RAG implementation by running multiple experiments and evaluating the results. It provides a standardized and consistent way to experiment with RAG. It also provides tools for configuring, indexing, querying, and evaluating using RAG.
AI/ML Capabilities Map

The AI/ML Capabilities Map shows the main capabilities within a typical AI/ML lifecycle.
AI Drift

Drift broadly refers to changes in data and the concepts and mechanisms that they represent over time.
Experimentation tools and setup in Azure ML

Recommended core suite of tools for any MLOps engagement

Training

Learning path

Introduction to machine learning operations (MLOps) - Training

Introduction to machine learning operations (MLOps)

Certification

Microsoft Certified: Azure Data Scientist Associate - Certifications

Manage data ingestion and preparation, model training and deployment, and machine learning solution monitoring with Python, Azure Machine Learning and MLflow.

Share via