ParallelRunStep Class
Creates an Azure Machine Learning Pipeline step to process large amounts of data asynchronously and in parallel.
For an example of using ParallelRunStep, see the notebook https://aka.ms/batch-inference-notebooks.
For troubleshooting guide, see https://aka.ms/prstsg. You can find more references there.
Create an Azure ML Pipeline step to process large amounts of data asynchronously and in parallel.
For an example of using ParallelRunStep, see the notebook link https://aka.ms/batch-inference-notebooks.
- Inheritance
-
azureml.pipeline.core._parallel_run_step_base._ParallelRunStepBaseParallelRunStep
Constructor
ParallelRunStep(name, parallel_run_config, inputs, output=None, side_inputs=None, arguments=None, allow_reuse=True)
Parameters
- name
- str
Name of the step. Must be unique to the workspace, only consist of lowercase letters, numbers, or dashes, start with a letter, and be between 3 and 32 characters long.
- parallel_run_config
- ParallelRunConfig
A ParallelRunConfig object used to determine required run properties.
- inputs
- list[Union[DatasetConsumptionConfig, PipelineOutputFileDataset, PipelineOutputTabularDataset]]
List of input datasets. All datasets in the list should be of same type. Input data will be partitioned for parallel processing. Each dataset in the list is partitioned into mini-batches separately, and each of the mini-batches is treated equally in the parallel processing.
Output port binding, may be used by later pipeline steps.
- side_inputs
- list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData, PipelineOutputFileDataset, PipelineOutputTabularDataset, DatasetConsumptionConfig]]
List of side input reference data. Side inputs will not be partitioned as input data.
List of command-line arguments to pass to the Python entry_script.
- allow_reuse
- bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
- name
- str
Name of the step. Must be unique to the workspace, only consist of lowercase letters, numbers, or dashes, start with a letter, and be between 3 and 32 characters long.
- parallel_run_config
- ParallelRunConfig
A ParallelRunConfig object used to determine required run properties.
- inputs
- list[Union[DatasetConsumptionConfig, PipelineOutputFileDataset, PipelineOutputTabularDataset]]
List of input datasets. All datasets in the list should be of same type. Input data will be partitioned for parallel processing. Each dataset in the list is partitioned into mini-batches separately, and each of the mini-batches is treated equally in the parallel processing.
- output
- PipelineData, OutputPortBinding
Output port binding, may be used by later pipeline steps.
- side_inputs
- list[Union[InputPortBinding, DataReference, PortDataReference, PipelineData, PipelineOutputFileDataset, PipelineOutputTabularDataset, DatasetConsumptionConfig]]
List of side input reference data. Side inputs will not be partitioned as input data.
- allow_reuse
- bool
Whether the step should reuse previous results when run with the same settings/inputs. If this is false, a new run will always be generated for this step during pipeline execution.
Remarks
ParallelRunStep can be used for processing large amounts of data in parallel. Common use cases are training an ML model or running offline inference to generate predictions on a batch of observations. ParallelRunStep works by breaking up your data into batches that are processed in parallel. The batch size node count, and other tunable parameters to speed up your parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.
To use ParallelRunStep:
Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.
Create a ParallelRunStep object that uses the ParallelRunConfig object, define inputs and outputs for the step.
Use the configured ParallelRunStep object in a Pipeline just as you would with other pipeline step types.
Examples of working with ParallelRunStep and ParallelRunConfig classes for batch inference are discussed in the following articles:
Tutorial: Build an Azure Machine Learning pipeline for batch scoring. This article shows how to use these two classes for asynchronous batch scoring in a pipeline and enable a REST endpoint to run the pipeline.
Run batch inference on large amounts of data by using Azure Machine Learning. This article shows how to process large amounts of data asynchronously and in parallel with a custom inference script and a pre-trained image classification model bases on the MNIST dataset.
from azureml.pipeline.steps import ParallelRunStep, ParallelRunConfig
parallel_run_config = ParallelRunConfig(
source_directory=scripts_folder,
entry_script=script_file,
mini_batch_size="5",
error_threshold=10, # Optional, allowed failed count on mini batch items
allowed_failed_count=15, # Optional, allowed failed count on mini batches
allowed_failed_percent=10, # Optional, allowed failed percent on mini batches
output_action="append_row",
environment=batch_env,
compute_target=compute_target,
node_count=2)
parallelrun_step = ParallelRunStep(
name="predict-digits-mnist",
parallel_run_config=parallel_run_config,
inputs=[ named_mnist_ds ],
output=output_dir,
arguments=[ "--extra_arg", "example_value" ],
allow_reuse=True
)
For more information about this example, see the notebook https://aka.ms/batch-inference-notebooks.
Methods
create_module_def |
Create the module definition object that describes the step. This method is not intended to be used directly. |
create_node |
Create a node for PythonScriptStep and add it to the specified graph. This method is not intended to be used directly. When a pipeline is instantiated with ParallelRunStep, Azure Machine Learning automatically passes the parameters required through this method so that the step can be added to a pipeline graph that represents the workflow. |
create_module_def
Create the module definition object that describes the step.
This method is not intended to be used directly.
create_module_def(execution_type, input_bindings, output_bindings, param_defs=None, create_sequencing_ports=True, allow_reuse=True, version=None, arguments=None)
Parameters
- create_sequencing_ports
- bool
If true, sequencing ports will be created for the module.
- allow_reuse
- bool
If true, the module will be available to be reused in future Pipelines.
Returns
The module def object.
Return type
create_node
Create a node for PythonScriptStep and add it to the specified graph.
This method is not intended to be used directly. When a pipeline is instantiated with ParallelRunStep, Azure Machine Learning automatically passes the parameters required through this method so that the step can be added to a pipeline graph that represents the workflow.
create_node(graph, default_datastore, context)
Parameters
- default_datastore
- AbstractAzureStorageDatastore or AzureDataLakeDatastore
Default datastore.
- context
- <xref:azureml.pipeline.core._GraphContext>
Context.
Returns
The created node.
Return type
Feedback
Submit and view feedback for