Using workflows and repos for MLOps in Databricks

The following diagram demonstrates an end-to-end MLOps process that is common in enterprise:

MLOps Architecture

This process supports three different environments that can be implemented using a single Databricks workspace or separated across three different workspaces.

Common tasks include moving code from one branch into another (PR, merging), invoking the model approval process in the production environment, and running integration and unit tests. Most of the tasks can be done thanks to Azure DevOps Builds or GitHub Workflows. Models can be stored and tracked in a MLflow model repository that has been integrated with Databricks.

However, the process requires some additional features to be feasible:

  • The ability to run code that might be separated in several files and custom libraries.
  • Job clusters support.
  • An option to implement ML flow as a set of independent steps.

In the MLOps Template based on MLflow Projects, we have all the features from MLflow framework and some custom rules and naming conventions.

Workflows

Thanks to the Workflow management in Databricks it's possible to create a job that contains one or more tasks. The job can be invoked by using API, by schedule or through UI. Moreover, each task in the job can have its own job cluster configured. It's possible to use API to create jobs and use a custom naming convention to give the choice to differentiate jobs in a multi-user environment. Try a basic workflow using this tutorial.

Databricks repos

Databricks Repos allows us to have a copy of a repository in Databricks, and run workflows against it. Databricks supports branches and a rich API to interact with Databricks Repos. Therefore, it's possible to implement your own branching strategy in Databricks. You can also implement automation for integration testing or model training on toy or full datasets. This document contains more details about how Databricks Repos can be used in the development process.

Next steps

The MLOps for Databricks Solution Accelerator implements this architecture using the latest Databricks features discussed in this content.