ParallelRunStep Class

Reference

Creates an Azure Machine Learning Pipeline step to process large amounts of data asynchronously and in parallel.

Note

This package, azureml-contrib-pipeline-steps, has been deprecated and moved to azureml-pipeline-steps.

Please use the ParallelRunStep class from new package.

For an example of using ParallelRunStep, see the notebook https://aka.ms/batch-inference-notebooks.

For troubleshooting guide, see https://aka.ms/prstsg. You can find more references there.

Create an Azure ML Pipeline step to process large amounts of data asynchronously and in parallel.

For an example of using ParallelRunStep, see the notebook link https://aka.ms/batch-inference-notebooks.

Inheritance: azureml.pipeline.core._python_script_step_base._PythonScriptStepBase

ParallelRunStep

Constructor

ParallelRunStep(name, parallel_run_config, inputs, output=None, side_inputs=None, models=None, arguments=None, allow_reuse=True, tags=None, properties=None, add_parallel_run_step_dependencies=True)

Parameters

Name	Description
name Required	str Name of the step. Must be unique to the workspace, only consist of lowercase letters, numbers, or dashes, start with a letter, and be between 3 and 32 characters long.
parallel_run_config Required	ParallelRunConfig A ParallelRunConfig object used to determine required run properties.
inputs Required	list[DatasetConsumptionConfig] List of input datasets. All datasets in the list should be of same type.
output	PipelineData, OutputPortBinding Output port binding, may be used by later pipeline steps. default value: None
side_inputs	list[PipelineData] List of side input reference data. default value: None
models	list[Model] A list of zero or more model objects. This list is used to track pipeline to model version mapping only. Models are not copied to container. Use the get_model_path method of the Model class to retrieve a model in the init function in entry_script. default value: None
arguments	list[str] List of command-line arguments to pass to the Python entry_script. default value: None
allow_reuse	bool Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution. default value: True
tags	dict[str, str] Dictionary of key value tags for this step. default value: None
properties	bool Dictionary of key value properties for this step. default value: None
add_parallel_run_step_dependencies	[Deprecated] Whether to add runtime dependencies for ParallelRunStep. These include: azure-storage-queue~=2.1 azure-storage-common~=2.1 azureml-core~=1.0 azureml-telemetry~=1.0 filelock~=3.0 azureml-dataset-runtime[fuse,pandas]~=1.1 psutil default value: True
name Required	str Name of the step. Must be unique to the workspace, only consist of lowercase letters, numbers, or dashes, start with a letter, and be between 3 and 32 characters long.
parallel_run_config Required	ParallelRunConfig A ParallelRunConfig object used to determine required run properties.
inputs Required	list[DatasetConsumptionConfig] List of input datasets. All datasets in the list should be of same type.
output Required	PipelineData, OutputPortBinding Output port binding, may be used by later pipeline steps.
side_inputs Required	list[PipelineData] List of side input reference data.
models Required	list[Model] [Deprecated] A list of zero or more model objects. This list is used to track pipeline to model version mapping only. Models are not copied to container. Use the get_model_path method of the Model class to retrieve a model in the init function in entry_script.
arguments Required	list[str] List of command-line arguments to pass to the Python entry_script.
allow_reuse Required	bool Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
tags Required	dict[str, str] [Deprecated] Dictionary of key value tags for this step.
properties Required	bool [Deprecated] Dictionary of key value properties for this step.
add_parallel_run_step_dependencies Required	[Deprecated] Whether to add runtime dependencies for ParallelRunStep. These include: azure-storage-queue~=2.1 azure-storage-common~=2.1 azureml-core~=1.0 azureml-telemetry~=1.0 filelock~=3.0 azureml-dataset-runtime[fuse,pandas]~=1.1 psutil

Remarks

The ParallelRunStep class can be used for any kind of processing job that involves large amounts of data and is not time-sensitive, such as batch training or batch scoring. The ParallelRunStep works by breaking up a large job into batches that are processed in parallel. The batch size and degree of parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.

To work with the ParallelRunStep class the following pattern is typical:

Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.
Create a ParallelRunStep object that uses the ParallelRunConfig object, defines inputs and outputs for the step, and list of models to use.
Use the configured ParallelRunStep object in a Pipeline just as you would with pipeline step types defined in the steps package.

Examples of working with ParallelRunStep and ParallelRunConfig classes for batch inference are discussed in the following articles:

Tutorial: Build an Azure Machine Learning pipeline for batch scoring. This article shows how to use these two classes for asynchronous batch scoring in a pipeline and enable a REST endpoint to run the pipeline.
Run batch inference on large amounts of data by using Azure Machine Learning. This article shows how to process large amounts of data asynchronously and in parallel with a custom inference script and a pre-trained image classification model based on the MNIST dataset.


   from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig

   parallel_run_config = ParallelRunConfig(
       source_directory=scripts_folder,
       entry_script=script_file,
       mini_batch_size="5",
       error_threshold=10,
       output_action="append_row",
       environment=batch_env,
       compute_target=compute_target,
       node_count=2)

   parallelrun_step = ParallelRunStep(
       name="predict-digits-mnist",
       parallel_run_config=parallel_run_config,
       inputs=[ named_mnist_ds ],
       output=output_dir,
       models=[ model ],
       arguments=[ ],
       allow_reuse=True
   )

For more information about this example, see the notebook https://aka.ms/batch-inference-notebooks.

Methods

create_module_def

Create the module definition object that describes the step.

This method is not intended to be used directly.

create_node

Create a node for PythonScriptStep and add it to the specified graph.

This method is not intended to be used directly. When a pipeline is instantiated with ParallelRunStep, Azure Machine Learning automatically passes the parameters required through this method so that the step can be added to a pipeline graph that represents the workflow.

create_module_def

Create the module definition object that describes the step.

This method is not intended to be used directly.

create_module_def(execution_type, input_bindings, output_bindings, param_defs=None, create_sequencing_ports=True, allow_reuse=True, version=None, arguments=None)

Parameters

Name	Description
execution_type Required	str The execution type of the module.
input_bindings Required	list The step input bindings.
output_bindings Required	list The step output bindings.
param_defs	list The step param definitions. default value: None
create_sequencing_ports	bool If true, sequencing ports will be created for the module. default value: True
allow_reuse	bool If true, the module will be available to be reused in future Pipelines. default value: True
version	str The version of the module. default value: None
arguments	list Annotated arguments list to use when calling this module. default value: None

Returns

Type	Description
ModuleDef	The module def object.

create_node

Create a node for PythonScriptStep and add it to the specified graph.

create_node(graph, default_datastore, context)

Parameters

Name	Description
graph Required	Graph Graph object.
default_datastore Required	AbstractAzureStorageDatastore or AzureDataLakeDatastore Default datastore.
context Required	<xref:azureml.pipeline.core._GraphContext> Context.

Returns

Type	Description
Node	The created node.

ParallelRunStep Class

Constructor

Parameters

Remarks

Methods

create_module_def

Parameters

Returns

create_node

Parameters

Returns

Feedback

Feedback

Additional resources