Use multistep pipeline components in pipeline jobs

2024-09-17

APPLIES TO: Azure CLI ml extension v2 (current) Python SDK azure-ai-ml v2 (current)

It's common to use pipeline components to develop complex machine learning pipelines. You can group multiple steps into a pipeline component that you use as a single step to do tasks like data preprocessing or model training.

This article shows you how to nest multiple steps in components that you use to build complex Azure Machine Learning pipeline jobs. You can develop and test these multistep components standalone, which helps you share your work and collaborate better with team members.

By using multistep pipeline components, you can focus on developing subtasks and easily integrate them with the entire pipeline job. A pipeline component has a well-defined input and output interface, so multistep pipeline component users don't need to know the implementation details of the component.

Both pipeline components and pipeline jobs contain groups of steps or components, but defining a pipeline component differs from defining a pipeline job in the following ways:

Pipeline components define only the interfaces of inputs and outputs. In a pipeline component, you explicitly set the input and output types, but you don't directly assign values to them.
Pipeline components don't have runtime settings, so you can't hardcode a compute or data node in a pipeline component. Instead you must promote these nodes as pipeline level inputs and assign values during runtime.
Pipeline level settings such as default_datastore and default_compute are also runtime settings that aren't part of pipeline component definitions.

Prerequisites

Have an Azure Machine Learning workspace. For more information, see Create workspace resources.
Understand the concepts of Azure Machine Learning pipelines and components, and know how to use components in Azure Machine Learning pipelines.

Install the Azure CLI and the ml extension. For more information, see Install, set up, and use the CLI (v2). The ml extension automatically installs the first time you run an az ml command.
Understand how to create and run Azure Machine Learning pipelines and components with the CLI v2.

Build pipeline jobs with pipeline components

You can define multiple steps as a pipeline component, and then use the multistep component like any other component to build a pipeline job.

Define pipeline components

You can use multiple components to build a pipeline component, similar to how you build pipeline jobs with components.

The following example comes from the pipeline_with_train_eval_pipeline_component example pipeline in the Azure Machine Learning examples GitHub repository.

The example component defines a three-node pipeline job. The two nodes in the example pipeline job each use the locally defined components train, score, and eval. The following code defines the pipeline component:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineComponent.schema.json
type: pipeline

name: train_pipeline_component
display_name: train_pipeline_component
description: Dummy train-score-eval pipeline component with local components

inputs:
  training_data: 
    type: uri_folder  # default/path is not supported for data type
  test_data: 
    type: uri_folder  # default/path is not supported for data type
  training_max_epochs:
    type: integer
  training_learning_rate: 
    type: number
  learning_rate_schedule:
    type: string
    default: 'time-based'
  train_node_compute: # example to show how to promote compute as input
    type: string

outputs: 
  trained_model:
    type: uri_folder
  evaluation_report:
    type: uri_folder

jobs:
  train_job:
    type: command
    component: ./train/train.yml
    inputs:
      training_data: ${{parent.inputs.training_data}}
      max_epochs: ${{parent.inputs.training_max_epochs}}
      learning_rate: ${{parent.inputs.training_learning_rate}}
      learning_rate_schedule: ${{parent.inputs.learning_rate_schedule}}
      
    outputs:
      model_output: ${{parent.outputs.trained_model}}
    compute: ${{parent.inputs.train_node_compute}}
  
  score_job:
    type: command
    component: ./score/score.yml
    inputs:
      model_input: ${{parent.jobs.train_job.outputs.model_output}}
      test_data: ${{parent.inputs.test_data}}
    outputs:
      score_output: 
        mode: upload

  evaluate_job:
    type: command
    component: ./eval/eval.yml
    inputs:
      scoring_result: ${{parent.jobs.score_job.outputs.score_output}}
    outputs:
      eval_output: ${{parent.outputs.evaluation_report}}

You can define a pipeline component using a Python function, which is similar to defining a pipeline job using a function. You can also promote the compute of some steps to use as inputs for the pipeline component.

The following Python SDK examples are from the Build pipeline with subpipeline (pipeline component) Azure Machine Learning notebook. Run this notebook to build the example pipeline.

@pipeline()
def train_pipeline_component(
    training_input: Input,
    test_input: Input,
    training_learning_rate: float,
    train_compute: str,
    training_max_epochs: int = 20,
    learning_rate_schedule: str = "time-based",
):
    """E2E dummy train-score-eval pipeline with components defined via yaml."""
    # Call component obj as function: apply given inputs & parameters to create a node in pipeline
    train_with_sample_data = train_model(
        training_data=training_input,
        max_epochs=training_max_epochs,
        learning_rate=training_learning_rate,
        learning_rate_schedule=learning_rate_schedule,
    )
    train_with_sample_data.compute = train_compute

    score_with_sample_data = score_data(
        model_input=train_with_sample_data.outputs.model_output, test_data=test_input
    )
    score_with_sample_data.outputs.score_output.mode = "upload"

    eval_with_sample_data = eval_model(
        scoring_result=score_with_sample_data.outputs.score_output
    )

    # Return: pipeline outputs
    return {
        "trained_model": train_with_sample_data.outputs.model_output,
        "evaluation_report": eval_with_sample_data.outputs.eval_output,
    }

Use components in pipelines

You reference pipeline components as child jobs in a pipeline job just like you reference other types of components. You can provide runtime settings like default_datastore and default_compute at the pipeline job level.

You need to promote any parameters you want to change during runtime as pipeline job inputs. Otherwise, they're hard-coded in the pipeline component. Promoting compute definition to a pipeline level input supports heterogeneous pipelines that can use different compute targets in different steps.

To submit the pipeline job, edit the cpu-cluster in the default_compute section before you run the az ml job create -f pipeline.yml command.

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: pipeline_with_pipeline_component
experiment_name: pipeline_with_pipeline_component
description: Select best model trained with different learning rate
type: pipeline

inputs:
  pipeline_job_training_data: 
    type: uri_folder
    path: ./data
  pipeline_job_test_data: 
    type: uri_folder
    path: ./data
  pipeline_job_training_learning_rate1: 0.1
  pipeline_job_training_learning_rate2: 0.01
  compute_train_node: cpu-cluster
  compute_compare_node: cpu-cluster

outputs: 
  pipeline_job_best_model:
    mode: upload
  pipeline_job_best_result:
    mode: upload

settings:
  default_datastore: azureml:workspaceblobstore
  default_compute: azureml:cpu-cluster
  continue_on_step_failure: false

jobs:
  train_and_evaluate_model1:
    type: pipeline
    component: ./components/train_pipeline_component.yml
    inputs:
      training_data: ${{parent.inputs.pipeline_job_training_data}}
      test_data: ${{parent.inputs.pipeline_job_test_data}}
      training_max_epochs: 20
      training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate1}}
      train_node_compute: ${{parent.inputs.compute_train_node}}

  train_and_evaluate_model2:
    type: pipeline
    component: ./components/train_pipeline_component.yml
    inputs:
      training_data: ${{parent.inputs.pipeline_job_training_data}}
      test_data: ${{parent.inputs.pipeline_job_test_data}}
      training_max_epochs: 20
      training_learning_rate: ${{parent.inputs.pipeline_job_training_learning_rate2}}
      train_node_compute: ${{parent.inputs.compute_train_node}}

  compare:
    type: command
    component: ./components/compare2/compare2.yml
    compute: ${{parent.inputs.compute_compare_node}} # example to show how to promote compute as pipeline level inputs
    inputs:
      model1: ${{parent.jobs.train_and_evaluate_model1.outputs.trained_model}}
      eval_result1: ${{parent.jobs.train_and_evaluate_model1.outputs.evaluation_report}}
      model2: ${{parent.jobs.train_and_evaluate_model2.outputs.trained_model}}
      eval_result2: ${{parent.jobs.train_and_evaluate_model2.outputs.evaluation_report}}
    outputs: 
      best_model: ${{parent.outputs.pipeline_job_best_model}}
      best_result: ${{parent.outputs.pipeline_job_best_result}}

Note

To share or reuse components across jobs in the workspace, you need to register the components. You can use az ml component create to register pipeline components.

You can find other Azure CLI pipeline component-related examples and information at pipelines-with-components in the Azure Machine Learning examples repository.

You can use the pipeline component as a step like other components in the pipeline job.

# Construct pipeline
@pipeline
def pipeline_with_pipeline_component(
    training_input,
    test_input,
    compute_train_node,
    training_learning_rate1=0.1,
    training_learning_rate2=0.01,
):
    # Create two training pipeline component with different learning rate
    # Use anonymous pipeline function for step1
    train_and_evaluate_model1 = train_pipeline_component(
        training_input=training_input,
        test_input=test_input,
        training_learning_rate=training_learning_rate1,
        train_compute=compute_train_node,
    )
    # Use registered pipeline function for step2
    train_and_evaluate_model2 = registered_pipeline_component(
        training_input=training_input,
        test_input=test_input,
        training_learning_rate=training_learning_rate2,
        train_compute=compute_train_node,
    )

    compare2_models = compare2(
        model1=train_and_evaluate_model1.outputs.trained_model,
        eval_result1=train_and_evaluate_model1.outputs.evaluation_report,
        model2=train_and_evaluate_model2.outputs.trained_model,
        eval_result2=train_and_evaluate_model2.outputs.evaluation_report,
    )
    # Return: pipeline outputs
    return {
        "best_model": compare2_models.outputs.best_model,
        "best_result": compare2_models.outputs.best_result,
    }


pipeline_job = pipeline_with_pipeline_component(
    training_input=Input(type="uri_folder", path="./data/"),
    test_input=Input(type="uri_folder", path="./data/"),
    compute_train_node="cpu-cluster",
)

# set pipeline level compute
pipeline_job.settings.default_compute = "cpu-cluster"

Note

To share or reuse components across jobs in the workspace, you need to register the components. You can use ml_client.components.create_or_update to register pipeline components.

You can find other Python SDK v2 pipeline component-related notebooks and information at Pipeline component in the Azure Machine Learning examples GitHub repository.

Share via

Use multistep pipeline components in pipeline jobs

Prerequisites

Build pipeline jobs with pipeline components

Define pipeline components

Use components in pipelines

Related content

Feedback

Additional resources