CLI (v2) Automated ML Forecasting command job YAML schema
APPLIES TO: Azure CLI ml extension v2 (current)
The source JSON schema can be found at https://azuremlschemas.azureedge.net/latest/autoMLForecastingJob.schema.json
Note
The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.
YAML syntax
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
$schema |
string | The location/url to load the YAML schema. If the user uses the Azure Machine Learning VS Code extension to author the YAML file, including $schema at the top of the file enables the user to invoke schema and resource completions. |
||
compute |
string | Required. The name of the AML compute infrastructure to execute the job on. The compute can be either a reference to an existing compute machine in the workspace Note: jobs in pipeline don't support 'local' as compute . The 'local' here means that compute instance created in user's Azure Machine Learning studio workspace. |
1. pattern [^azureml:<compute_name>] to use existing compute,2. 'local' to use local execution |
'local' |
limits |
object | Represents a dictionary object consisting of limit configurations of the Automated ML tabular job. The key is name for the limit within the context of the job and the value is limit value. See limits to find out the properties of this object. |
||
name |
string | The name of the submitted Automated ML job. It must be unique across all jobs in the workspace. If not specified, Azure Machine Learning autogenerates a GUID for the name. |
||
description |
string | The description of the Automated ML job. | ||
display_name |
string | The name of the job that user wants to display in the studio UI. It can be non-unique within the workspace. If it's omitted, Azure Machine Learning autogenerates a human-readable adjective-noun identifier for the display name. | ||
experiment_name |
string | The name of the experiment. Experiments are records of your ML training jobs on Azure. Experiments contain the results of your runs, along with logs, charts, and graphs. Each job's run record is organized under the corresponding experiment in the studio's "Experiments" tab. |
Name of the working directory in which it was created | |
environment_variables |
object | A dictionary object of environment variables to set on the process where the command is being executed. | ||
outputs |
object | Represents a dictionary of output configurations of the job. The key is a name for the output within the context of the job and the value is the output configuration. See job output to find out properties of this object. | ||
log_files |
object | A dictionary object containing logs of an Automated ML job execution | ||
log_verbosity |
string | The level of log verbosity for writing to the log file. The acceptable values are defined in the Python logging library. |
'not_set' , 'debug' , 'info' , 'warning' , 'error' , 'critical' |
'info' |
type |
const | Required. The type of job. |
automl |
automl |
task |
const | Required. The type of Automated ML task to execute. |
forecasting |
forecasting |
target_column_name |
string | Required. Represents the name of the column to be forecasted. The Automated ML job raises an error if not specified. |
||
featurization |
object | A dictionary object defining the configuration of custom featurization. In case it isn't created, the Automated ML config applies auto featurization. See featurization to see the properties of this object. | ||
forecasting |
object | A dictionary object defining the settings of forecasting job. See forecasting to find out the properties of this object. | ||
n_cross_validations |
string or integer | The number of cross validations to perform during model/pipeline selection if validation_data isn't specified.In case both validation_data and this parameter isn't provided or set to None , then Automated ML job set it to auto by default. In case distributed_featurization is enabled and validation_data isn't specified, then it's set to 2 by default. |
'auto' , [int] |
None |
primary_metric |
string | A metric that Automated ML optimizes for Time Series Forecasting model selection. If allowed_training_algorithms has 'tcn_forecaster' to use for training, then Automated ML only supports in 'normalized_root_mean_squared_error' and 'normalized_mean_absolute_error' to be used as primary_metric. |
"spearman_correlation" , "normalized_root_mean_squared_error" , "r2_score" "normalized_mean_absolute_error" |
"normalized_root_mean_squared_error" |
training |
object | A dictionary object defining the configuration that is used in model training. Check training to find out the properties of this object. |
||
training_data |
object | Required A dictionary object containing the MLTable configuration defining training data to be used in as input for model training. This data is a subset of data and should be composed of both independent features/columns and target feature/column. The user can use a registered MLTable in the workspace using the format ':' (e.g Input(mltable='my_mltable:1')) OR can use a local file or folder as a MLTable(e.g Input(mltable=MLTable(local_path="./data")). This object must be provided. If target feature isn't present in source file, then Automated ML throws an error. Check training or validation or test data to find out the properties of this object. |
||
validation_data |
object | A dictionary object containing the MLTable configuration defining validation data to be used within Automated ML experiment for cross validation. It should be composed of both independent features/columns and target feature/column if this object is provided. Samples in training data and validation data can't overlap in a fold. See training or validation or test data to find out the properties of this object. In case this object isn't defined, then Automated ML uses n_cross_validations to split validation data from training data defined in training_data object. |
||
test_data |
object | A dictionary object containing the MLTable configuration defining test data to be used in test run for predictions in using best model and evaluates the model using defined metrics. It should be composed of only independent features used in training data (without target feature) if this object is provided. Check training or validation or test data to find out the properties of this object. If it isn't provided, then Automated ML uses other built-in methods to suggest best model to use for inferencing. |
limits
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
enable_early_termination |
boolean | Represents whether to enable of experiment termination if the loss score doesn't improve after 'x' number of iterations. In an Automated ML job, no early stopping is applied on first 20 iterations. The early stopping window starts only after first 20 iterations. |
true , false |
true |
max_concurrent_trials |
integer | The maximum number of trials (children jobs) that would be executed in parallel. It's highly recommended to set the number of concurrent runs to the number of nodes in the cluster (aml compute defined in compute ). |
1 |
|
max_trials |
integer | Represents the maximum number of trials an Automated ML job can try to run a training algorithm with different combination of hyperparameters. Its default value is set to 1000. If enable_early_termination is defined, then the number of trials used to run training algorithms can be smaller. |
1000 |
|
max_cores_per_trial |
integer | Represents the maximum number of cores per that are available to be used by each trial. Its default value is set to -1, which means all cores are used in the process. | -1 |
|
timeout_minutes |
integer | The maximum amount of time in minutes that the submitted Automated ML job can take to run. After the specified amount of time, the job is terminated. This timeout includes setup, featurization, training runs, ensembling and model explainability (if provided) of all trials. Note that it doesn't include the ensembling and model explainability runs at the end of the process if the job fails to get completed within provided timeout_minutes since these features are available once all the trials (children jobs) are done. Its default value is set to 360 minutes (6 hours). To specify a timeout less than or equal to 1 hour (60 minutes), the user should make sure dataset's size isn't greater than 10,000,000 (rows times column) or an error results. |
360 |
|
trial_timeout_minutes |
integer | The maximum amount of time in minutes that each trial (child job) in the submitted Automated ML job can take run. After the specified amount of time, the child job will get terminated. | 30 |
|
exit_score |
float | The score to achieve by an experiment. The experiment terminates after the specified score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the defined primary metric . |
forecasting
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
time_column_name |
string | Required The name of the column in the dataset that corresponds to the time axis of each time series. The input dataset for training, validation or test must contain this column if the task is forecasting . If not provided or set to None , Automated ML forecasting job throws an error and terminate the experiment. |
||
forecast_horizon |
string or integer | The maximum forecast horizon in units of time-series frequency. These units are based on the inferred time interval of your training data, (Ex: monthly, weekly) that the forecaster uses to predict. If it is set to None or auto , then its default value is set to 1, meaning 't+1' from the last timestamp t in the input data. |
auto , [int] |
1 |
frequency |
string | The frequency at which the forecast generation is desired, for example daily, weekly, yearly, etc. If it isn't specified or set to None, then its default value is inferred from the dataset time index. The user can set its value greater than dataset's inferred frequency, but not less than it. For example, if dataset's frequency is daily, it can take values like daily, weekly, monthly, but not hourly as hourly is less than daily(24 hours). Refer to pandas documentation for more information. |
None |
|
time_series_id_column_names |
string or list(strings) | The names of columns in the data to be used to group data into multiple time series. If time_series_id_column_names is not defined or set to None, the Automated ML uses auto-detection logic to detect the columns. | None |
|
feature_lags |
string | Represents if user wants to generate lags automatically for the provided numeric features. The default is set to auto , meaning that Automated ML uses autocorrelation-based heuristics to automatically select lag orders and generate corresponding lag features for all numeric features. "None" means no lags are generated for any numeric features. |
'auto' , None |
None |
country_or_region_for_holidays |
string | The country or region to be used to generate holiday features. These characters should be represented in ISO 3166 two-letter country/region codes, for example 'US' or 'GB'. The list of the ISO codes can be found at https://wikipedia.org/wiki/List_of_ISO_3166_country_codes. | None |
|
cv_step_size |
string or integer | The number of periods between the origin_time of one CV fold and the next fold. For example, if it is set to 3 for daily data, the origin time for each fold is three days apart. If it set to None or not specified, then it's set to auto by default. If it is of integer type, minimum value it can take is 1 else it raises an error. |
auto , [int] |
auto |
seasonality |
string or integer | The time series seasonality as an integer multiple of the series frequency. If seasonality is not specified, its value is set to 'auto' , meaning it is inferred automatically by Automated ML. If this parameter is not set to None , the Automated ML assumes time series as non-seasonal, which is equivalent to setting it as integer value 1. |
'auto' , [int] |
auto |
short_series_handling_config |
string | Represents how Automated ML should handle short time series if specified. It takes following values:
|
'auto' , 'pad' , 'drop' , None |
auto |
target_aggregate_function |
string | Represents the aggregate function to be used to aggregate the target column in time series and generate the forecasts at specified frequency (defined in freq ). If this parameter is set, but the freq parameter is not set, then an error is raised. It is omitted or set to None, then no aggregation is applied. |
'sum' , 'max' , 'min' , 'mean' |
auto |
target_lags |
string or integer or list(integer) | The number of past/historical periods to use to lag from the target values based on the dataset frequency. By default, this parameter is turned off. The 'auto' setting allows system to use automatic heuristic based lag. This lag property should be used when the relationship between the independent variables and dependent variable do not correlate by default. For more information, see Lagged features for time series forecasting in Automated ML. |
'auto' , [int] |
None |
target_rolling_window_size |
string or integer | The number of past observations to use for creating a rolling window average of the target column. When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model. | 'auto' , integer, None |
None |
use_stl |
string | The components to generate by applying STL decomposition on time series.If not provided or set to None, no time series component is generated. use_stl can take two values: 'season' : to generate season component. 'season_trend' : to generate both season Automated ML and trend components. |
'season' , 'seasontrend' |
None |
training or validation or test data
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
datastore |
string | The name of the datastore where data is uploaded by user. | ||
path |
string | The path from where data should be loaded. It can be a file path, folder path or pattern for paths. pattern specifies a search pattern to allow globbing(* and ** ) of files and folders containing data. Supported URI types are azureml , https , wasbs , abfss , and adl . For more information, see Core yaml syntax to understand how to use the azureml:// URI format. URI of the location of the artifact file. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created. |
||
type |
const | The type of input data. In order to generate computer vision models, the user needs to bring labeled image data as input for model training in the form of an MLTable. | mltable |
mltable |
training
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
allowed_training_algorithms |
list(string) | A list of Time Series Forecasting algorithms to try out as base model for model training in an experiment. If it is omitted or set to None, then all supported algorithms are used during experiment, except algorithms specified in blocked_training_algorithms . |
'auto_arima' , 'prophet' , 'naive' ,'seasonal_naive' , 'average' , 'seasonal_average' , 'exponential_smoothing' , 'arimax' , 'tcn_forecaster' , 'elastic_net' , 'gradient_boosting' , 'decision_tree' , 'knn' , 'lasso_lars' , 'sgd' , 'random_forest' , 'extreme_random_trees' , 'light_gbm' , 'xg_boost_regressor' |
None |
blocked_training_algorithms |
list(string) | A list of Time Series Forecasting algorithms to not run as base model while model training in an experiment. If it is omitted or set to None, then all supported algorithms are used during model training. | 'auto_arima' , 'prophet' , 'naive' , 'seasonal_naive' , 'average' , 'seasonal_average' , 'exponential_smoothing' , 'arimax' ,'tcn_forecaster' , 'elastic_net' , 'gradient_boosting' , 'decision_tree' , 'knn' , 'lasso_lars' , 'sgd' , 'random_forest' , 'extreme_random_trees' , 'light_gbm' , 'xg_boost_regressor' |
None |
enable_dnn_training |
boolean | A flag to turn on or off the inclusion of DNN based models to try out during model selection. | True , False |
False |
enable_model_explainability |
boolean | Represents a flag to turn on model explainability like feature importance, of best model evaluated by Automated ML system. | True , False |
True |
enable_vote_ensemble |
boolean | A flag to enable or disable the ensembling of some base models using Voting algorithm. For more information about ensembles, see Set up Auto train. | true , false |
true |
enable_stack_ensemble |
boolean | A flag to enable or disable ensembling of some base models using Stacking algorithm. In forecasting tasks, this flag is turned off by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Set up Auto train. | true , false |
false |
featurization
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
mode |
string | The featurization mode to be used by Automated ML job. Setting it to: 'auto' indicates whether featurization step should be done automatically'off' indicates no featurization<'custom' indicates whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on. |
'auto' , 'off' , 'custom' |
None |
blocked_transformers |
list(string) | A list of transformer names to be blocked during featurization step by Automated ML, if featurization mode is set to 'custom'. |
'text_target_encoder' , 'one_hot_encoder' , 'cat_target_encoder' , 'tf_idf' , 'wo_e_target_encoder' , 'label_encoder' , 'word_embedding' , 'naive_bayes' , 'count_vectorizer' , 'hash_one_hot_encoder' |
None |
column_name_and_types |
object | A dictionary object consisting of column names as dict key and feature types used to update column purpose as associated value, if featurization mode is set to 'custom'. |
||
transformer_params |
object | A nested dictionary object consisting of transformer name as key and corresponding customization parameters on dataset columns for featurization, if featurization mode is set to 'custom'.The forecasting only supports imputer transformer for customization.Check out column_transformers to find out how to create customization parameters. |
None |
column_transformers
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
fields |
list(string) | A list of column names on which provided transformer_params should be applied. |
||
parameters |
object | A dictionary object consisting of 'strategy' as key and value as imputation strategy. More details on how it can be provided, is provided in examples here. |
Job outputs
Key | Type | Description | Allowed values | Default value |
---|---|---|---|---|
type |
string | The type of job output. For the default uri_folder type, the output corresponds to a folder. |
uri_folder , mlflow_model , custom_model |
uri_folder |
mode |
string | The mode of how output file(s) are delivered to the destination storage. For read-write mount mode (rw_mount ) the output directory is a mounted directory. For upload mode, the file(s) written are uploaded at the end of the job. |
rw_mount , upload |
rw_mount |
How to run forecasting job via CLI
az ml job create --file [YOUR_CLI_YAML_FILE] --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]