Experimentation tools and setup in Azure ML

The basics

Based on customer engagements, the recommended core suite of tools for any MLOps engagement includes:

  1. Visual Studio Code.
  2. Azure Machine Workspace (minimum: 1 workspace for Dev, 1 for Production).
  3. Azure Machine Learning Compute Instances for Experimentation (1 per dev is common).
  4. A Centralized Git Repo (for example, Azure DevOps or GitHub).
  5. An agreed upon folder structure.
    1. See Sample Structure below.

Why use VS Code?

Visual Studio Code, often called VS Code, is an open-source development environment, which has a large library of extensions to customize it to each developer's needs. VS Code is able to run Jupyter Notebooks which data scientists are familiar with. It also provides tools such as Intellisense and Intellicode to help them migrate their notebooks into reusable scripts.

By combining the power of VS Code along with Azure ML's Compute Instances, data scientists have access to a powerful suite of tools. The Compute Instance can be pre-configured ahead of time with the necessary packages and kernels that a data scientist needs. It allows them to focus on their work instead of environment setup.

Sample folder structure

A clear folder structure helps manage a complex project such as an MLOps engagement. A common successful pattern is:

├── .devcontainer       <- Container definition for local development
│   ├── devcontainer.json
│   └── Dockerfile
├── .pipelines          <- CI/CD pipeline definitions
│   ├── ci-build.yaml
│   ├── pr-build.yaml
│   └── deploy.yaml
├── data                <- Datasets, if of reasonable size
│   ├── external        <- Data from third party sources
│   ├── raw             <- Original, immutable data dump
│   ├── interim         <- Intermediate data that has been transformed
│   └── processed       <- Final, canonical data sets for modeling
├── docs                <- Collection of markdown files or Sphinx project
├── notebooks           <- Jupyter Notebooks for experimentation (linting optional)
│   └── usrID_userStoryNumber_description.ipynb
├── mlops               <- ML pipeline definition code (such as CLI v2 .YAML files)
│   └── training
│       ├── pipeline.yml
│       └── train-env.yml
├── src                 <- Model creation code, for example, data prep, training, and scoring
│   └── train
|       └── train.py
│   └── score
|       └── score.py

The goal with a folder structure such as this is to provide logical working spaces for the various pieces of an MLOps engagement. During experimentation, much of the work is done in the notebooks folder, where data scientists may check in Jupyter Notebooks.

For another example of an MLOps folder structure, review the MLOps Template for Azure ML CLI V2.

Notebooks are traditionally difficult to version control since each re-running of a notebook causes the underlying code of the notebook to change. Tools such as nb-clean and nbQA can be integrated into PR gates or, better yet, pre-commit hooks to ease integration of notebooks with version control. While this integration enables basic linting and code formatting within notebooks, these tools may be new to the data scientist's workflows. Their use should be evaluated against disruption on a project-by-project basis.

Setting up for experimentation

Once the repo has been created and the core requisites are met, the team is ready to connect to a Compute Instance and begin experimentation.

From VS Code, developers and data scientists can connect to their compute instance by configuring a remote compute instance, or configuring a remote Jupyter server.

Once connected to the Compute Instance, users can clone the Git repo directly to the Workspace file system.

And with that connection, users have a managed, scalable compute target to perform experimentation on. It is cloud-based and connected directly to the Azure Machine Learning Workspace.