CLI (v2) parallel job YAML schema
APPLIES TO: Azure CLI ml extension v2 (current)
Important
Parallel job can only be used as a single step inside an Azure Machine Learning pipeline job. Thus, there is no source JSON schema for parallel job at this time. This document lists the valid keys and their values when creating a parallel job in a pipeline.
Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.
YAML syntax
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
type |
const | Required. The type of job. | parallel |
|
inputs |
object | Dictionary of inputs to the parallel job. The key is a name for the input within the context of the job and the value is the input value. Inputs can be referenced in the program_arguments using the ${{ inputs.<input_name> }} expression. Parallel job inputs can be referenced by pipeline inputs using the ${{ parent.inputs.<input_name> }} expression. For how to bind the inputs of a parallel step to the pipeline inputs, see the Expression syntax for binding inputs and outputs between steps in a pipeline job. |
||
inputs.<input_name> |
number, integer, boolean, string or object | One of a literal value (of type number, integer, boolean, or string) or an object containing a job input data specification. | ||
outputs |
object | Dictionary of output configurations of the parallel job. The key is a name for the output within the context of the job and the value is the output configuration. Parallel job outputs can be referenced by pipeline outputs using the ${{ parents.outputs.<output_name> }} expression. For how to bind the outputs of a parallel step to the pipeline outputs, see the Expression syntax for binding inputs and outputs between steps in a pipeline job. |
||
outputs.<output_name> |
object | You can leave the object empty, in which case by default the output will be of type uri_folder and Azure Machine Learning will system-generate an output location for the output based on the following templatized path: {settings.datastore}/azureml/{job-name}/{output-name}/ . File(s) to the output directory will be written via read-write mount. If you want to specify a different mode for the output, provide an object containing the job output specification. |
||
compute |
string | Name of the compute target to execute the job on. The value can be either a reference to an existing compute in the workspace (using the azureml:<compute_name> syntax) or local to designate local execution. When using parallel job in pipeline, you can leave this setting empty, in which case the compute will be auto-selected by the default_compute of pipeline. |
local |
|
task |
object | Required. The template for defining the distributed tasks for parallel job. See Attributes of the task key. |
||
input_data |
object | Required. Define which input data will be split into mini-batches to run the parallel job. Only applicable for referencing one of the parallel job inputs by using the ${{ inputs.<input_name> }} expression |
||
mini_batch_size |
string | Define the size of each mini-batch to split the input. If the input_data is a folder or set of files, this number defines the file count for each mini-batch. For example, 10, 100. If the input_data is a tabular data from mltable , this number defines the proximate physical size for each mini-batch. For example, 100 kb, 100 mb. |
1 | |
partition_keys |
list | The keys used to partition dataset into mini-batches. If specified, the data with the same key will be partitioned into the same mini-batch. If both partition_keys and mini_batch_size are specified, the partition keys will take effect. |
||
mini_batch_error_threshold |
integer | Define the number of failed mini batches that could be ignored in this parallel job. If the count of failed mini-batch is higher than this threshold, the parallel job will be marked as failed. Mini-batch is marked as failed if: - the count of return from run() is less than mini-batch input count. - catch exceptions in custom run() code. "-1" is the default number, which means to ignore all failed mini-batch during parallel job. |
[-1, int.max] | -1 |
logging_level |
string | Define which level of logs will be dumped to user log files. | INFO, WARNING, DEBUG | INFO |
resources.instance_count |
integer | The number of nodes to use for the job. | 1 | |
max_concurrency_per_instance |
integer | Define the number of processes on each node of compute. For a GPU compute, the default value is 1. For a CPU compute, the default value is the number of cores. |
||
retry_settings.max_retries |
integer | Define the number of retries when mini-batch is failed or timeout. If all retries are failed, the mini-batch will be marked as failed to be counted by mini_batch_error_threshold calculation. |
2 | |
retry_settings.timeout |
integer | Define the timeout in seconds for executing custom run() function. If the execution time is higher than this threshold, the mini-batch will be aborted, and marked as a failed mini-batch to trigger retry. | (0, 259200] | 60 |
environment_variables |
object | Dictionary of environment variable key-value pairs to set on the process where the command is executed. |
Attributes of the task
key
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
type |
const | Required. The type of task. Only applicable for run_function by now.In run_function mode, you're required to provide code , entry_script , and program_arguments to define Python script with executable functions and arguments. Note: Parallel job only supports Python script in this mode. |
run_function | run_function |
code |
string | Local path to the source code directory to be uploaded and used for the job. | ||
entry_script |
string | The Python file that contains the implementation of pre-defined parallel functions. For more information, see Prepare entry script to parallel job. | ||
environment |
string or object | Required The environment to use for running the task. The value can be either a reference to an existing versioned environment in the workspace or an inline environment specification. To reference an existing environment, use the azureml:<environment_name>:<environment_version> syntax or azureml:<environment_name>@latest (to reference the latest version of an environment). To define an inline environment, follow the Environment schema. Exclude the name and version properties as they aren't supported for inline environments. |
||
program_arguments |
string | The arguments to be passed to the entry script. May contain "--<arg_name> ${{inputs.<intput_name>}}" reference to inputs or outputs. Parallel job provides a list of predefined arguments to set configuration of parallel run. For more information, see predefined arguments for parallel job. |
||
append_row_to |
string | Aggregate all returns from each run of mini-batch and output it into this file. May reference to one of the outputs of parallel job by using the expression ${{outputs.<output_name>}} |
Job inputs
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
type |
string | The type of job input. Specify mltable for input data that points to a location where has the mltable meta file, or uri_folder for input data that points to a folder source. |
mltable , uri_folder |
uri_folder |
path |
string | The path to the data to use as input. The value can be specified in a few ways: - A local path to the data source file or folder, for example, path: ./iris.csv . The data will get uploaded during job submission. - A URI of a cloud path to the file or folder to use as the input. Supported URI types are azureml , https , wasbs , abfss , adl . For more information, see Core yaml syntax on how to use the azureml:// URI format. - An existing registered Azure Machine Learning data asset to use as the input. To reference a registered data asset, use the azureml:<data_name>:<data_version> syntax or azureml:<data_name>@latest (to reference the latest version of that data asset), for example, path: azureml:cifar10-data:1 or path: azureml:cifar10-data@latest . |
||
mode |
string | Mode of how the data should be delivered to the compute target. For read-only mount ( ro_mount ), the data will be consumed as a mount path. A folder will be mounted as a folder and a file will be mounted as a file. Azure Machine Learning will resolve the input to the mount path. For download mode the data will be downloaded to the compute target. Azure Machine Learning will resolve the input to the downloaded path. If you only want the URL of the storage location of the data artifact(s) rather than mounting or downloading the data itself, you can use the direct mode. It will pass in the URL of the storage location as the job input. In this case, you're fully responsible for handling credentials to access the storage. |
ro_mount , download , direct |
ro_mount |
Job outputs
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
type |
string | The type of job output. For the default uri_folder type, the output will correspond to a folder. |
uri_folder |
uri_folder |
mode |
string | Mode of how output file(s) will get delivered to the destination storage. For read-write mount mode (rw_mount ) the output directory will be a mounted directory. For upload mode the file(s) written will get uploaded at the end of the job. |
rw_mount , upload |
rw_mount |
Predefined arguments for parallel job
Key | Description | Allowed values | Default value |
---|---|---|---|
--error_threshold |
The threshold of failed items. Failed items are counted by the number gap between inputs and returns from each mini-batch. If the sum of failed items is higher than this threshold, the parallel job will be marked as failed. Note: "-1" is the default number, which means to ignore all failures during parallel job. |
[-1, int.max] | -1 |
--allowed_failed_percent |
Similar to mini_batch_error_threshold but uses the percent of failed mini-batches instead of the count. |
[0, 100] | 100 |
--task_overhead_timeout |
The timeout in second for initialization of each mini-batch. For example, load mini-batch data and pass it to run() function. | (0, 259200] | 30 |
--progress_update_timeout |
The timeout in second for monitoring the progress of mini-batch execution. If no progress updates receive within this timeout setting, the parallel job will be marked as failed. | (0, 259200] | Dynamically calculated by other settings. |
--first_task_creation_timeout |
The timeout in second for monitoring the time between the job start to the run of first mini-batch. | (0, 259200] | 600 |
--copy_logs_to_parent |
Boolean option to whether copy the job progress, overview, and logs to the parent pipeline job. | True, False | False |
--metrics_name_prefix |
Provide the custom prefix of your metrics in this parallel job. | ||
--push_metrics_to_parent |
Boolean option to whether push metrics to the parent pipeline job. | True, False | False |
--resource_monitor_interval |
The time interval in seconds to dump node resource usage(for example, cpu, memory) to log folder under "logs/sys/perf" path. Note: Frequent dump resource logs will slightly slow down the execution speed of your mini-batch. Set this value to "0" to stop dumping resource usage. |
[0, int.max] | 600 |
Remarks
The az ml job
commands can be used for managing Azure Machine Learning jobs.
Examples
Examples are available in the examples GitHub repository. Several are shown below.
YAML: Using parallel job in pipeline
$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json
type: pipeline
display_name: iris-batch-prediction-using-parallel
description: The hello world pipeline job with inline parallel job
tags:
tag: tagvalue
owner: sdkteam
settings:
default_compute: azureml:cpu-cluster
jobs:
batch_prediction:
type: parallel
compute: azureml:cpu-cluster
inputs:
input_data:
type: mltable
path: ./neural-iris-mltable
mode: direct
score_model:
type: uri_folder
path: ./iris-model
mode: download
outputs:
job_output_file:
type: uri_file
mode: rw_mount
input_data: ${{inputs.input_data}}
mini_batch_size: "10kb"
resources:
instance_count: 2
max_concurrency_per_instance: 2
logging_level: "DEBUG"
mini_batch_error_threshold: 5
retry_settings:
max_retries: 2
timeout: 60
task:
type: run_function
code: "./script"
entry_script: iris_prediction.py
environment:
name: "prs-env"
version: 1
image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04
conda_file: ./environment/environment_parallel.yml
program_arguments: >-
--model ${{inputs.score_model}}
--error_threshold 5
--allowed_failed_percent 30
--task_overhead_timeout 1200
--progress_update_timeout 600
--first_task_creation_timeout 600
--copy_logs_to_parent True
--resource_monitor_interva 20
append_row_to: ${{outputs.job_output_file}}