Use multistep pipeline components in pipeline jobs

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

It's common to use pipeline components to develop complex machine learning pipelines. You can group multiple steps into a pipeline component that you use as a single step to do tasks like data preprocessing or model training.

This article shows you how to nest multiple steps in components that you use to build complex Azure Machine Learning pipeline jobs. You can develop and test these multistep components standalone, which helps you share your work and collaborate better with team members.

By using multistep pipeline components, you can focus on developing subtasks and easily integrate them with the entire pipeline job. A pipeline component has a well-defined input and output interface, so multistep pipeline component users don't need to know the implementation details of the component.

Both pipeline components and pipeline jobs contain groups of steps or components, but defining a pipeline component differs from defining a pipeline job in the following ways:

  • Pipeline components define only the interfaces of inputs and outputs. In a pipeline component, you explicitly set the input and output types, but you don't directly assign values to them.
  • Pipeline components don't have runtime settings, so you can't hardcode a compute or data node in a pipeline component. Instead you must promote these nodes as pipeline level inputs and assign values during runtime.
  • Pipeline level settings such as default_datastore and default_compute are also runtime settings that aren't part of pipeline component definitions.

Prerequisites

  • Have an Azure Machine Learning workspace. For more information, see Create workspace resources.
  • Understand the concepts of Azure Machine Learning pipelines and components, and know how to use components in Azure Machine Learning pipelines.

Build pipeline jobs with pipeline components

You can define multiple steps as a pipeline component, and then use the multistep component like any other component to build a pipeline job.

Define pipeline components

You can use multiple components to build a pipeline component, similar to how you build pipeline jobs with components.

The following example comes from the pipeline_with_train_eval_pipeline_component example pipeline in the Azure Machine Learning examples GitHub repository.

The example component defines a three-node pipeline job. The two nodes in the example pipeline job each use the locally defined components train, score, and eval. The following code defines the pipeline component:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline

name: train_pipeline_component
display_name: train_pipeline_component
description: Dummy train-score-eval pipeline component with local components

inputs:
  training_data: 
    type: uri_folder  # default/path is not supported for data type
  test_data: 
    type: uri_folder  # default/path is not supported for data type
  training_max_epochs:
    type: integer
  training_learning_rate: 
    type: number
  learning_rate_schedule:
    type: string
    default: 'time-based'
  train_node_compute: # example to show how to promote compute as input
    type: string

outputs: 
  trained_model:
    type: uri_folder
  evaluation_report:
    type: uri_folder

jobs:
  train_job:
    type: command
    component: ./train/train.yml
    inputs:
      training_data: ${{parent.inputs.training_data}}
      max_epochs: ${{parent.inputs.training_max_epochs}}
      learning_rate: ${{parent.inputs.training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.learning_rate_schedule}}
      
    outputs:
      model_output: ${{parent.outputs.trained_model}}
    compute: ${{parent.inputs.train_node_compute}}
  
  score_job:
    type: command
    component: ./score/score.yml
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: ${{parent.inputs.test_data}}
    outputs:
      score_output: 
        mode: upload

  evaluate_job:
    type: command
    component: ./eval/eval.yml
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.evaluation_report}}

Use components in pipelines

You reference pipeline components as child jobs in a pipeline job just like you reference other types of components. You can provide runtime settings like default_datastore and default_compute at the pipeline job level.

You need to promote any parameters you want to change during runtime as pipeline job inputs. Otherwise, they're hard-coded in the pipeline component. Promoting compute definition to a pipeline level input supports heterogenous pipelines that can use different compute targets in different steps.

To submit the pipeline job, edit the cpu-cluster in the default_compute section before you run the az ml job create -f pipeline.yml command.

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: pipeline_with_pipeline_component
experiment_name: pipeline_with_pipeline_component
description: Select best model trained with different learning rate
type: pipeline

inputs:
  pipeline_job_training_data: 
    type: uri_folder
    path: ./data
  pipeline_job_test_data: 
    type: uri_folder
    path: ./data
  pipeline_job_training_learning_rate1: 0.1
  pipeline_job_training_learning_rate2: 0.01
  compute_train_node: cpu-cluster
  compute_compare_node: cpu-cluster

outputs: 
  pipeline_job_best_model:
    mode: upload
  pipeline_job_best_result:
    mode: upload

settings:
  default_datastore: azureml:workspaceblobstore
  default_compute: azureml:cpu-cluster
  continue_on_step_failure: false

jobs:
  train_and_evaluate_model1:
    type: pipeline
    component: ./components/train_pipeline_component.yml
    inputs:
      training_data: ${{parent.inputs.pipeline_job_training_data}}
      test_data: ${{parent.inputs.pipeline_job_test_data}}
      training_max_epochs: 20
      training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate1}}
      train_node_compute: ${{parent.inputs.compute_train_node}}

  train_and_evaluate_model2:
    type: pipeline
    component: ./components/train_pipeline_component.yml
    inputs:
      training_data: ${{parent.inputs.pipeline_job_training_data}}
      test_data: ${{parent.inputs.pipeline_job_test_data}}
      training_max_epochs: 20
      training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate2}}
      train_node_compute: ${{parent.inputs.compute_train_node}}

  compare:
    type: command
    component: ./components/compare2/compare2.yml
    compute: ${{parent.inputs.compute_compare_node}} # example to show how to promote compute as pipeline level inputs
    inputs:
      model1: ${{parent.jobs.train_and_evaluate_model1.outputs.trained_model}}
      eval_result1: ${{parent.jobs.train_and_evaluate_model1.outputs.evaluation_report}}
      model2: ${{parent.jobs.train_and_evaluate_model2.outputs.trained_model}}
      eval_result2: ${{parent.jobs.train_and_evaluate_model2.outputs.evaluation_report}}
    outputs: 
      best_model: ${{parent.outputs.pipeline_job_best_model}}
      best_result: ${{parent.outputs.pipeline_job_best_result}}

Note

To share or reuse components across jobs in the workspace, you need to register the components. You can use az ml component create to register pipeline components.

You can find other Azure CLI pipeline component-related examples and information at pipelines-with-components in the Azure Machine Learning examples repository.