Automating and monitoring ML model development

Our customers build ML capabilities to improve operational efficiency and automation. When scaling these capabilities across an enterprise developers face the most friction. This solution helps software engineers and data scientists create repeatable ML experiments, reusable code, and scalable deployments of their ML models.

Business problem

Traditional software development is not well-suited to the unique requirements of machine learning models. Often, machine learning models are built in a research environment and then manually moved to production, where they need to be maintained. So, manual deployment:

  • Leads to inefficiencies in deployment and management.
  • Limits reproducibility and transparency of ML models.
  • Inhibits collaboration between data science and operations teams.
  • Reduces the ability to monitor and detect drift in ML models for our customers.

Solution overview

The broader practice of MLOps provides a framework for managing the complete lifecycle of machine learning models, from development to deployment to monitoring and maintenance. This solution uses a combination of CI/CD pipelines, ML pipelines, and cloud-based orchestration to train, retrain, and monitor machine learning models. Alongside these tools, it is important to develop a process for managing the ML lifecycle, as these tools cannot fully enforce proper ML development and operations on their own. Organizations use these tools to build accurate, scalable, and secure ML solutions that deliver business value over time.

Value proposition

For data scientists and data engineers, this solution enables:

  • Transparent and reproducible ML model training.
  • Automated model retraining and deployment to reduce human error and inefficiency.
  • A simplified path from experimentation to deployment of a new model.
  • Scaled impact of Data Scientists who define and develop ML models.
  • Standards for governance and security of ML models.

Logical Architecture

Mermaid diagram #1

Solution building blocks

This solution is based on building real MLOps solutions and incorporates capabilities.

Stage Capability Description
Experimentation Experimentation in MLOps Learn how to manage Jupyter Notebooks and Model experimentation in an MLOps framework
Experimentation in Azure ML Review an example implementation of a code repo for ML development using Azure ML
Model Development ML Pipelines Build a Pipeline to train a model
ML Testing Scenarios Create Unit Tests for your ML Model
Deployment Release Pipelines Integrate the ML pipeline into a CI/CD pipeline
Sync - Async MLOps Pattern Understand when to use async jobs in AzDo for model training
Model Release Learn about the various model release options
Model Deployment in Azure ML How to deploy a model using Azure ML
Model Flighting Use managed online endpoints to flight versions of the model for deployment
ML Lifecycle Management Drift Monitoring Understand the basics of Data Drift monitoring and how to implement it

Implementations

This example solution has been implemented in several GitHub code repos.

Core MLOps Templates (Azure ML)

These two templates provide the code structure necessary to create a production-level automated model training pipeline. They use Azure Machine Learning and Azure Pipelines (or GitHub Actions) as the core services. Both provide example pipelines, and a folder structure suited to most ML tasks.

The SDK (Software Development Kit) based template and CLI (Command Line Interface) based template are two different approaches to using MLOps templates.

Here are some differences of an SDK-based MLOps template compared to a CLI-based template:

  • Customization: An SDK-based template allows for greater customization since it provides access to underlying code and allows engineers to modify it to suit their specific needs. This flexibility is useful when creating complex applications requiring advanced features or functionality. For example, the SDK allows conditional steps in a pipeline, as well as making querying and filtering pipeline results easier.

  • Integration: An SDK-based template is easier to integrate into other applications, frameworks, and systems since it is built using standard programming languages and libraries. These features make it easier to incorporate into larger projects or environments.

  • Platform independence: An SDK-based template can be used on multiple platforms, such as Windows, Linux, and macOS, without requiring any modifications. An SDK is designed to be cross-platform and used across different operating systems.

  • Simplicity: CLI-based templates are typically easier to use since they only require engineers to enter commands into the command-line interface. Developers with less programming experience may find this method more accessible.

  • Fast prototyping: CLI-based templates can be used for quick prototyping of ideas and do not need complex programming. Proof-of-concept applications or exploring new ideas can benefit from the quick turnaround.

  • Automation: CLI-based templates can be automated to perform repetitive tasks quickly, such as building and deploying applications. This automation saves engineer's significant amounts of time and effort in the development process.

  • Non-Python development: Both SDK and CLI-based MLOps templates are built on top of Azure Machine Learning provider REST API. Details of these REST APIs are available at AML REST API. CLI-based templates or a direct REST API should be used for non-python based development.

Some more comparison details are available in the official documentation - SDK VS CLI.

The choice between an SDK-based or CLI-based template depends on the needs of the project and the expertise of the engineers involved. Both approaches have their own strengths and engineers should evaluate each option carefully before deciding.

MLOps Model Factory Template

Link to Template

The MLOps Model Factory template helps to deploy a model platform for automatically building, training, and deploying ML models at scale. It includes features for creating and managing large numbers of ML models, and automates the model building process.

  • Model Data: Each Model can define its own data source and data with no linkage to other models.
  • Model framework: Each Model can be based on its choice of framework. Models can be built using widely used frameworks like PyTorch, Keras and Scikit-learn among others.
  • Dependency: Each Model can determine its own environment and includes its required dependencies into it.
  • Model pipeline: Each Model can have its own customized pipeline. The template code includes data preparation, transformation, Model training, scoring, evaluation and registration.
  • Path to production: Each Model can evolve separately and differently across time.

Databricks Templates

These templates provide similar capabilities as the Azure ML templates, but use Databricks as the main compute interface for training the machine learning models:

Learn More

If this is your first time exploring MLOps, the bulk of this information is meant to explain and explore MLOps; consider reviewing more of the capabilities and guidance before implementing this solution. Good places to start include:

  • MLOps 101 is our high-level overview of MLOps and why it matters.
  • Working with Azure ML is an overview of the Azure ML service, which is crucial to several implementations.
  • Databricks vs Azure ML breaks down the key features and differences between the Azure Machine Learning service and Databricks.

For customers who are using Databricks: