MLOps 101
Delivering Machine Learning
Machine Learning DevOps (MLOps) is an organizational approach that relies on a combination of people, processes, and technology. This approach delivers Machine Learning solutions in a robust, scalable, reliable, and automated way. This guide provides a balanced view across this combined three areas.
How Machine Learning DevOps is different from DevOps
For an intro to Azure's tools for Machine Learning and MLOps, check out the Azure ML Intro document.
Exploration precedes development and operations
Data science projects are different from application development or data engineering projects. Data science projects may or may not make it to production. After an initial analysis, it might become clear that the business outcome can't be achieved with the available datasets. Because of this reason, an exploration phase is usually the first step in a data science project. The goal in this phase is to define and refine the problem and run exploratory data analysis. During exploratory data analysis, statistics and visualizations are used to confirm or falsify the problem hypotheses. There must be a common understanding that the project might not extend beyond this phase. It's important to make this phase as seamless as possible to have a quick turnaround.
Data scientists work most efficiently with tools of their choice. Unless there are strict security requirements, it is better not to enforce too restrictive processes at this stage. For some guidance on introducing security, refer to the Data Privacy section
Real data is needed for data exploration work.
The experimentation and development stage usually begins when there is enough confidence that the data science project is feasible and can provide real business value. This stage is when development practices become increasingly important. It's a good practice to capture metrics for all of the experiments that are done at this stage. It's also important to incorporate source control, which makes it possible to compare models and toggle between different versions of the code if needed. Development activities include the following tasks:
- The refactoring, testing, and automation of exploration code into repeatable experimentation pipelines
- The creation of model serving applications and pipelines.
Refactoring code into more modular components and libraries helps increase reusability and testability, and it allows for performance optimization. Finally, what is deployed into staging and production environments is the model serving application or batch inference pipelines.
Infrastructure reliability and performance must be monitored, similar to what's done for a regular application with traditional DevOps. The quality of the data, the data profile, and the performance of the model must be continuously monitored against its business objectives to mitigate factors such as data drift.
Machine Learning models require retraining over time to stay relevant in a changing environment. Refer to the Data Drift section for more information.
Data Science Lifecycle requires an adaptive way of working
If you apply a typical DevOps way of working to a data science project, you might not find success. Success might not be achieved because of the uncertain nature of data quality, and the correlation (or lack thereof) between independent and dependent variables. Exploration and experimentation are recurring activities and need to be practiced throughout a machine learning project. The teams at Microsoft follow a project lifecycle and working process that was developed to reflect data science-specific activities. The Team Data Science Process and The Data Science Lifecycle Process are examples of reference implementations.
Data quality requirements and data availability constrain the work environment
For a Machine Learning team to effectively develop Machine Learning-infused applications, it is essential to have access to production data or data that is representative of the production data.
Machine Learning requires a greater operational effort
Unlike traditional software, a Machine Learning solution is constantly at risk of degradation because of its dependency on data quality. To maintain a high-quality solution once in production, continuous monitoring and re-evaluation of the data & model quality is critical. It's expected that a production model requires timely retraining, redeployment, and tuning. These tasks come on top of day-to-day security, infrastructure monitoring, or compliance requirements and require special expertise.
Machine Learning teams require specialists and domain experts
While data science projects share roles with regular IT projects, the success of a Machine Learning team highly depends on Machine Learning technology specialists and domain subject matter experts. Where the technology specialist has the right background to do end-to-end Machine Learning experimentation, the domain expert can support the specialist to analyze and synthesize the data, or qualify the data for use.
The following common technical roles that are unique to data science projects:
- Domain Expert
- Data Engineer
- Data Scientist
- AI Engineer
- Model Validator
- Machine Learning Engineer
To learn more about roles and tasks within a typical data science team, also refer to the Team Data Science Process.
Seven principles for Machine Learning DevOps
When you plan to adopt MLOps for your next Machine Learning project, consider applying the following core principles as the foundation to any project.
Version control code, data, and experimentation outputs
Unlike traditional software, data has a direct influence on the quality of Machine Learning models. Along with versioning your experimentation code base, version your datasets to ensure you can reproduce experiments or inference results. Versioning experimentation outputs like models can save effort and the computational cost of recreating them.
Use multiple environments
To segregate development and testing from production work, replicate your infrastructure in at least two environments. Access control for users might differ in each environment.
Manage infrastructure and configurations-as-code
When you create and update infrastructure components in your work environments, use infrastructure as code to prevent inconsistencies between environments. Manage Machine Learning experiment job specifications as code. Managing specifications as code makes it possible to easily rerun and reuse a version of your experiment across environments.
Track and manage Machine Learning experiments
Track the performance KPIs and other artifacts of your Machine Learning experiments. When you keep a history of job performance, it allows for a quantitative analysis of experimentation success, and enables greater team collaboration and agility.
Test code, validate data integrity, model quality
Test your experimentation code base for the following items:
- Correctness of data preparation functions
- Correctness of feature extraction functions
- Data integrity
- Model performance
Machine Learning continuous integration and delivery
Use continuous integration to automate test execution in your team. To ensure that only a high-quality model might land in production, include the following processes:
- Model training as part of continuous training pipeline.
- A/B testing as part of your release.
Monitor services, models, and data
When you serve Machine Learning models in an operationalized environment, it's critical to monitor these services for their infrastructure uptime, compliance, and model quality. Set up monitoring for identifying data and model drift to understand whether retraining is required. Alternatively, set up triggers for automatic retraining.
MLOps at organizational scale: AI factories
A data science team might decide they can manage a handful of Machine Learning use cases internally. The adoption of Machine Learning DevOps (MLOps) helps set up project teams for better quality, reliability, and maintainability of solutions through the following items:
- balanced teams
- supported processes
- technology automation
This adoption allows the team to scale and focus on the development of new use cases.
As the number of use cases grows in an organization, the management burden of supporting these use cases grows linearly, or even more. The challenge becomes how to use organizational scale to do the following:
- Accelerate time-to-market
- Quicken assessment of use case feasibility
- Enable repeatability
- Determine how to best utilize available resources and skill sets across the full range of projects.
An AI factory is the development of a repeatable business process and a collection of standardized artifacts that optimized the following aspects:
- Team set-up
- Recommended practices
- MLOps strategy
- Architectural patterns
- Reusable templates tailored to business requirements
This process and the artifacts accelerate the development and deployment of a large set of Machine Learning use cases.
Standardize on repeatable architectural patterns
Repeatability is a key part of developing a factory process. Data science teams can do the following by developing a few repeatable architectural patterns that cover most of the Machine Learning use cases for their organization:
- Accelerate project development
- Improve consistency across projects
Once these patterns are in place, most projects can use these patterns and reap the following benefits:
- Accelerated design phase
- Accelerated approvals from IT and security teams when they reuse tools across projects
- Accelerated development due to reusable infrastructure as code templates and project templates (which are covered in more detail in the next section).
The architectural patterns can include but aren't limited to the following topics:
- Preferred services for each stage of the project
- Data connectivity and governance
- A Machine Learning DevOps (MLOps) strategy tailored to the requirements of the industry, business, or data classification
- Experiment management process Champion or Challenger models
Facilitate cross-team collaboration and sharing
Shared code repositories and utilities can accelerate the development of Machine Learning solutions. These repositories can be developed in a modular way during project development so they are generic enough to be used by other projects. They can be made available in a central repository that all data science teams can access.
Share and reuse of intellectual property
At the beginning of a project, the following intellectual property should be reviewed to maximize code reuse:
- Internal code, such as packages and modules, which have been designed for reuse within the organization.
- Datasets, which have been created in other Machine Learning projects or that are available in the Azure ecosystem.
- Existing data science projects with similar architecture and business problems.
- GitHub or open-source repos that can accelerate the project.
- Project retrospectives should include an action item to review if elements of the project can be shared and generalized for broader reuse, so that the list of assets listed above organically grows with time.
To help sharing and discovery, many companies have introduced shared repositories for the organization of code snippets and Machine Learning artifacts. The following artifacts in Azure Machine Learning can be defined-as-code that allows you to share efficiently across projects and workspaces:
- Datasets
- Models
- Environments
- Pipelines
Project templates
Many companies have standardized on a project template to kick start a new project to do the following things:
- Accelerate the migration of existing solutions
- Maximize code reuse when starting a new project
Central data management
The process to get access to data for exploration or production usage can be time-consuming. Many companies centralize their data management to bring data producers and data consumers together and to help with easier data access for Machine Learning experimentation.
Shared utilities
Enterprise-wide centralized dashboards can be implemented to consolidate the following logging and monitoring information:
- Error logging
- Service availability
- Service Telemetry
- Model performance monitoring.
Create a specialist Machine Learning engineering team
Many companies have implemented the role of the Machine Learning engineer. The Machine Learning engineer specializes in the following aspects:
- The creation and operation of robust Machine Learning pipelines
- Drift monitoring and retraining workflows
- Monitoring dashboards
They drive the overall responsibility for industrializing the Machine Learning solution from development to production. They work closely with data engineering, architects, and security and operations to ensure all the necessary controls are in place.
While data science requires deep domain expertise, Machine Learning engineering as a discipline is more technical in focus. This difference makes the Machine Learning engineer more flexible to work across various projects and business departments. Large data science teams can benefit from a specialized Machine Learning engineering team. This specialized team can drive repeatability and reuse of automation workflows across various use cases and business departments.
Enablement and documentation
It's important to provide clear guidance on the AI factory process to new and existing teams and users. This guidance will ensure consistency and reduce the amount of effort required of the Machine Learning engineering team. Consider designing content specifically for the various roles in your organization.
Everyone has a unique learning style, so a mixture of the following types of documents can help accelerate the adoption of the AI factory framework.
- Central hub with links to all artifacts.
- Training and enablement plan designed for each role
- High-level summary presentation of the approach along with a companion video
- Detailed documentation
- How-to videos
- Readiness assessments
Ethics
Ethics play an instrumental role in the design of an AI solution. If ethical principles aren't implemented, trained models can exhibit the same bias present in the data they were trained on. This issue can result in the project being discontinued and more importantly, it can risk the organization's reputation.
To ensure that the key ethical principles that the company stands for are implemented across projects, a list of these principles must be provided. Along with the principles, ways of validating them from a technical perspective during the testing phase must also be provided. Use the Machine Learning features in Azure Machine Learning to learn the following things:
- What responsible Machine Learning is
- Ways you can put it into practice
Refer to the Responsible AI section.
For more detailed information, see the Machine Learning DevOps guide
For an intro to Azure's tools for Machine Learning and MLOps, check out the Azure ML Intro document.