Use multistep pipeline components in pipeline jobs
APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)
It's common to use pipeline components to develop complex machine learning pipelines. You can group multiple steps into a pipeline component that you use as a single step to do tasks like data preprocessing or model training.
This article shows you how to nest multiple steps in components that you use to build complex Azure Machine Learning pipeline jobs. You can develop and test these multistep components standalone, which helps you share your work and collaborate better with team members.
By using multistep pipeline components, you can focus on developing subtasks and easily integrate them with the entire pipeline job. A pipeline component has a well-defined input and output interface, so multistep pipeline component users don't need to know the implementation details of the component.
Both pipeline components and pipeline jobs contain groups of steps or components, but defining a pipeline component differs from defining a pipeline job in the following ways:
- Pipeline components define only the interfaces of inputs and outputs. In a pipeline component, you explicitly set the input and output types, but you don't directly assign values to them.
- Pipeline components don't have runtime settings, so you can't hardcode a compute or data node in a pipeline component. Instead you must promote these nodes as pipeline level inputs and assign values during runtime.
- Pipeline level settings such as
default_datastore
anddefault_compute
are also runtime settings that aren't part of pipeline component definitions.
Prerequisites
- Have an Azure Machine Learning workspace. For more information, see Create workspace resources.
- Understand the concepts of Azure Machine Learning pipelines and components, and know how to use components in Azure Machine Learning pipelines.
- Install the Azure CLI and the
ml
extension. For more information, see Install, set up, and use the CLI (v2). Theml
extension automatically installs the first time you run anaz ml
command. - Understand how to create and run Azure Machine Learning pipelines and components with the CLI v2.
Build pipeline jobs with pipeline components
You can define multiple steps as a pipeline component, and then use the multistep component like any other component to build a pipeline job.
Define pipeline components
You can use multiple components to build a pipeline component, similar to how you build pipeline jobs with components.
The following example comes from the pipeline_with_train_eval_pipeline_component example pipeline in the Azure Machine Learning examples GitHub repository.
The example component defines a three-node pipeline job. The two nodes in the example pipeline job each use the locally defined components train
, score
, and eval
. The following code defines the pipeline component:
$schema: https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline
name: train_pipeline_component
display_name: train_pipeline_component
description: Dummy train-score-eval pipeline component with local components
inputs:
training_data:
type: uri_folder # default/path is not supported for data type
test_data:
type: uri_folder # default/path is not supported for data type
training_max_epochs:
type: integer
training_learning_rate:
type: number
learning_rate_schedule:
type: string
default: 'time-based'
train_node_compute: # example to show how to promote compute as input
type: string
outputs:
trained_model:
type: uri_folder
evaluation_report:
type: uri_folder
jobs:
train_job:
type: command
component: ./train/train.yml
inputs:
training_data: ${{parent.inputs.training_data}}
max_epochs: ${{parent.inputs.training_max_epochs}}
learning_rate: ${{parent.inputs.training_learning_rate}}
learning_rate_schedule: ${{parent.inputs.learning_rate_schedule}}
outputs:
model_output: ${{parent.outputs.trained_model}}
compute: ${{parent.inputs.train_node_compute}}
score_job:
type: command
component: ./score/score.yml
inputs:
model_input: ${{parent.jobs.train_job.outputs.model_output}}
test_data: ${{parent.inputs.test_data}}
outputs:
score_output:
mode: upload
evaluate_job:
type: command
component: ./eval/eval.yml
inputs:
scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
outputs:
eval_output: ${{parent.outputs.evaluation_report}}
Use components in pipelines
You reference pipeline components as child jobs in a pipeline job just like you reference other types of components. You can provide runtime settings like default_datastore
and default_compute
at the pipeline job level.
You need to promote any parameters you want to change during runtime as pipeline job inputs. Otherwise, they're hard-coded in the pipeline component. Promoting compute definition to a pipeline level input supports heterogenous pipelines that can use different compute targets in different steps.
To submit the pipeline job, edit the cpu-cluster
in the default_compute
section before you run the az ml job create -f pipeline.yml
command.
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
display_name: pipeline_with_pipeline_component
experiment_name: pipeline_with_pipeline_component
description: Select best model trained with different learning rate
type: pipeline
inputs:
pipeline_job_training_data:
type: uri_folder
path: ./data
pipeline_job_test_data:
type: uri_folder
path: ./data
pipeline_job_training_learning_rate1: 0.1
pipeline_job_training_learning_rate2: 0.01
compute_train_node: cpu-cluster
compute_compare_node: cpu-cluster
outputs:
pipeline_job_best_model:
mode: upload
pipeline_job_best_result:
mode: upload
settings:
default_datastore: azureml:workspaceblobstore
default_compute: azureml:cpu-cluster
continue_on_step_failure: false
jobs:
train_and_evaluate_model1:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate1}}
train_node_compute: ${{parent.inputs.compute_train_node}}
train_and_evaluate_model2:
type: pipeline
component: ./components/train_pipeline_component.yml
inputs:
training_data: ${{parent.inputs.pipeline_job_training_data}}
test_data: ${{parent.inputs.pipeline_job_test_data}}
training_max_epochs: 20
training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate2}}
train_node_compute: ${{parent.inputs.compute_train_node}}
compare:
type: command
component: ./components/compare2/compare2.yml
compute: ${{parent.inputs.compute_compare_node}} # example to show how to promote compute as pipeline level inputs
inputs:
model1: ${{parent.jobs.train_and_evaluate_model1.outputs.trained_model}}
eval_result1: ${{parent.jobs.train_and_evaluate_model1.outputs.evaluation_report}}
model2: ${{parent.jobs.train_and_evaluate_model2.outputs.trained_model}}
eval_result2: ${{parent.jobs.train_and_evaluate_model2.outputs.evaluation_report}}
outputs:
best_model: ${{parent.outputs.pipeline_job_best_model}}
best_result: ${{parent.outputs.pipeline_job_best_result}}
Note
To share or reuse components across jobs in the workspace, you need to register the components. You can use az ml component create
to register pipeline components.
You can find other Azure CLI pipeline component-related examples and information at pipelines-with-components in the Azure Machine Learning examples repository.