ParallelRunConfig Class

Reference

Defines configuration for a ParallelRunStep object.

Note

This package, azureml-contrib-pipeline-steps, has been deprecated and moved to azureml-pipeline-steps.

Please use the ParallelRunConfig class from new package.

For an example of using ParallelRunStep, see the notebook https://aka.ms/batch-inference-notebooks.

For troubleshooting guide, see https://aka.ms/prstsg. You can find more references there.

Initialize the config object.

Inheritance: builtins.object

ParallelRunConfig

Constructor

ParallelRunConfig(environment, entry_script, error_threshold, output_action, compute_target, node_count, process_count_per_node=None, mini_batch_size=None, source_directory=None, description=None, logging_level=None, run_invocation_timeout=None, input_format=None, append_row_file_name=None)

Parameters

Name	Description
environment Required	Environment The environment definition that configures the Python environment. It can be configured to use an existing Python environment or to set up a temp environment for the experiment. The definition is also responsible for setting the required application dependencies.
entry_script Required	str User script which will be run in parallel on multiple nodes. This is specified as local file path. If `source_directory` is specified, then `entry_script` is a relative path inside. Otherwise, it can be any path accessible on the machine.
error_threshold Required	int The number of record failures for TabularDataset and file failures for FileDataset that should be ignored during processing. If the error count goes above this value, then the job will be aborted. Error threshold is for the entire input and not for individual mini-batches sent to run() method. The range is [-1, int.max]. -1 indicates ignore all failures during processing.
output_action Required	str How the output is to be organized. Currently supported values are 'append_row' and 'summary_only'. 'append_row' – All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location. 'summary_only' – User script is expected to store the output by itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).
compute_target Required	AmlCompute or str Compute target to use for ParallelRunStep. This parameter may be specified as a compute target object or the string name of a compute target in the workspace.
node_count Required	int Number of nodes in the compute target used for running the ParallelRunStep.
process_count_per_node	int Number of processes executed on each node. (optional, default value is number of cores on node.) default value: None
mini_batch_size	str For FileDataset input, this field is the number of files user script can process in one run() call. For TabularDataset input, this field is the approximate size of data the user script can process in one run() call. Example values are 1024, 1024KB, 10MB, and 1GB. (optional, default value is 10 files for FileDataset and 1MB for TabularDataset.) default value: None
source_directory	str Paths to folders that contains the `entry_script` and supporting files used to execute on compute target. default value: None
description	str A description to give the batch service used for display purposes. default value: None
logging_level	str A string of the logging level name, which is defined in 'logging'. Possible values are 'WARNING', 'INFO', and 'DEBUG'. (optional, default value is 'INFO'.) default value: None
run_invocation_timeout	int Timeout in seconds for each invocation of the run() method. (optional, default value is 60.) default value: None
input_format	str Deprecated. default value: None
environment Required	Environment The environment definition that configures the Python environment. It can be configured to use an existing Python environment or to set up a temp environment for the experiment. The definition is also responsible for setting the required application dependencies.
entry_script Required	str User script which will be run in parallel on multiple nodes. This is specified as local file path. If `source_directory` is specified, then `entry_script` is a relative path inside. Otherwise, it can be any path accessible on the machine.
error_threshold Required	int The number of record failures for TabularDataset and file failures for FileDataset that should be ignored during processing. If the error count goes above this value, then the job will be aborted. Error threshold is for the entire input and not for individual mini-batches sent to run() method. The range is [-1, int.max]. -1 indicates ignore all failures during processing.
output_action Required	str How the output is to be organized. Currently supported values are 'append_row' and 'summary_only'. 'append_row' – All values output by run() method invocations will be aggregated into one unique file named parallel_run_step.txt that is created in the output location. 'summary_only' – User script is expected to store the output by itself. An output row is still expected for each successful input item processed. The system uses this output only for error threshold calculation (ignoring the actual value of the row).
compute_target Required	AmlCompute or str Compute target to use for ParallelRunStep. This parameter may be specified as a compute target object or the string name of a compute target on the workspace.
node_count Required	int Number of nodes in the compute target used for running the ParallelRunStep.
process_count_per_node Required	int Number of processes executed on each node. (optional, default value is number of cores on node.)
mini_batch_size Required	str For FileDataset input, this field is the number of files user script can process in one run() call. For TabularDataset input, this field is the approximate size of data the user script can process in one run() call. Example values are 1024, 1024KB, 10MB, and 1GB. (optional, default value is 10 files for FileDataset and 1MB for TabularDataset.)
source_directory Required	str Paths to folders that contains the `entry_script` and supporting files used to execute on compute target.
description Required	str A description to give the batch service used for display purposes.
logging_level Required	str A string of the logging level name, which is defined in 'logging'. Possible values are 'WARNING', 'INFO', and 'DEBUG'. (optional, default value is 'INFO'.)
run_invocation_timeout Required	int Timeout in seconds for each invocation of the run() method. (optional, default value is 60.)
input_format Required	str Deprecated.
append_row_file_name	default value: None

Remarks

The ParallelRunConfig class is used to specify configuration for the ParallelRunStep class. The ParallelRunConfig and ParallelRunStep classes together can be used for any kind of processing job that involves large amounts of data and is not time-sensitive, such as training or scoring. The ParallelRunStep works by breaking up a large job into batches that are processed in parallel. The batch size and degree of parallel processing can be controlled with the ParallelRunConfig class. ParallelRunStep can work with either TabularDataset or FileDataset as input.

To work with the ParallelRunStep class the following pattern is typical:

Create a ParallelRunConfig object to specify how batch processing is performed, with parameters to control batch size, number of nodes per compute target, and a reference to your custom Python script.
Create a ParallelRunStep object that uses the ParallelRunConfig object, defines inputs and outputs for the step, and list of models to use.
Use the configured ParallelRunStep object in a Pipeline just as you would with pipeline step types defined in the steps package.

Examples of working with ParallelRunStep and ParallelRunConfig classes for batch inference are discussed in the following articles:

Tutorial: Build an Azure Machine Learning pipeline for batch scoring. This article shows how to use these two classes for asynchronous batch scoring in a pipeline and enable a REST endpoint to run the pipeline.
Run batch inference on large amounts of data by using Azure Machine Learning. This article shows how to process large amounts of data asynchronously and in parallel with a custom inference script and a pre-trained image classification model based on the MNIST dataset.


   from azureml.contrib.pipeline.steps import ParallelRunStep, ParallelRunConfig

   parallel_run_config = ParallelRunConfig(
       source_directory=scripts_folder,
       entry_script=script_file,
       mini_batch_size="5",
       error_threshold=10,
       output_action="append_row",
       environment=batch_env,
       compute_target=compute_target,
       node_count=2)

   parallelrun_step = ParallelRunStep(
       name="predict-digits-mnist",
       parallel_run_config=parallel_run_config,
       inputs=[ named_mnist_ds ],
       output=output_dir,
       models=[ model ],
       arguments=[ ],
       allow_reuse=True
   )

For more information about this example, see the notebook https://aka.ms/batch-inference-notebooks.

Methods

load_yaml	Load parallel run configuration data from a YAML file.
save_to_yaml	Export parallel run configuration data to a YAML file.

load_yaml

Load parallel run configuration data from a YAML file.

static load_yaml(workspace, path)

Parameters

Name	Description
workspace Required	Workspace The workspace to read the configuration data from.
path Required	str The path to load the configuration from.

save_to_yaml

Export parallel run configuration data to a YAML file.

save_to_yaml(path)

Parameters

Name	Description
path Required	str The path to save the file to.

Share via

ParallelRunConfig Class

Constructor

Parameters

Remarks

Methods

load_yaml

Parameters

save_to_yaml

Parameters

Feedback

Feedback

Additional resources