CLI (v2) Automated ML Forecasting command job YAML schema

Статия
08/28/2024

APPLIES TO: Azure CLI ml extension v2 (current)

The source JSON schema can be found at https://azuremlschemas.azureedge.net/latest/autoMLForecastingJob.schema.json

Note

The YAML syntax detailed in this document is based on the JSON schema for the latest version of the ML CLI v2 extension. This syntax is guaranteed only to work with the latest version of the ML CLI v2 extension. You can find the schemas for older extension versions at https://azuremlschemasprod.azureedge.net/.

YAML syntax

Key	Type	Description	Allowed values	Default value
`$schema`	string	The location/url to load the YAML schema. If the user uses the Azure Machine Learning VS Code extension to author the YAML file, including `$schema` at the top of the file enables the user to invoke schema and resource completions.
`compute`	string	Required. The name of the AML compute infrastructure to execute the job on. The compute can be either a reference to an existing compute machine in the workspace Note: jobs in pipeline don't support 'local' as `compute`. The 'local' here means that compute instance created in user's Azure Machine Learning studio workspace.	1. pattern `[^azureml:<compute_name>]` to use existing compute, 2.`'local'` to use local execution	`'local'`
`limits`	object	Represents a dictionary object consisting of limit configurations of the Automated ML tabular job. The key is name for the limit within the context of the job and the value is limit value. See limits to find out the properties of this object.
`name`	string	The name of the submitted Automated ML job. It must be unique across all jobs in the workspace. If not specified, Azure Machine Learning autogenerates a GUID for the name.
`description`	string	The description of the Automated ML job.
`display_name`	string	The name of the job that user wants to display in the studio UI. It can be non-unique within the workspace. If it's omitted, Azure Machine Learning autogenerates a human-readable adjective-noun identifier for the display name.
`experiment_name`	string	The name of the experiment. Experiments are records of your ML training jobs on Azure. Experiments contain the results of your runs, along with logs, charts, and graphs. Each job's run record is organized under the corresponding experiment in the studio's "Experiments" tab.		Name of the working directory in which it was created
`environment_variables`	object	A dictionary object of environment variables to set on the process where the command is being executed.
`outputs`	object	Represents a dictionary of output configurations of the job. The key is a name for the output within the context of the job and the value is the output configuration. See job output to find out properties of this object.
`log_files`	object	A dictionary object containing logs of an Automated ML job execution
`log_verbosity`	string	The level of log verbosity for writing to the log file. The acceptable values are defined in the Python logging library.	`'not_set'`, `'debug'`, `'info'`, `'warning'`, `'error'`, `'critical'`	`'info'`
`type`	const	Required. The type of job.	`automl`	`automl`
`task`	const	Required. The type of Automated ML task to execute.	`forecasting`	`forecasting`
`target_column_name`	string	Required. Represents the name of the column to be forecasted. The Automated ML job raises an error if not specified.
`featurization`	object	A dictionary object defining the configuration of custom featurization. In case it isn't created, the Automated ML config applies auto featurization. See featurization to see the properties of this object.
`forecasting`	object	A dictionary object defining the settings of forecasting job. See forecasting to find out the properties of this object.
`n_cross_validations`	string or integer	The number of cross validations to perform during model/pipeline selection if `validation_data` isn't specified. In case both `validation_data` and this parameter isn't provided or set to `None`, then Automated ML job set it to `auto` by default. In case `distributed_featurization` is enabled and `validation_data` isn't specified, then it's set to 2 by default.	`'auto'`, [int]	`None`
`primary_metric`	string	A metric that Automated ML optimizes for Time Series Forecasting model selection. If `allowed_training_algorithms` has 'tcn_forecaster' to use for training, then Automated ML only supports in 'normalized_root_mean_squared_error' and 'normalized_mean_absolute_error' to be used as primary_metric.	`"spearman_correlation"`, `"normalized_root_mean_squared_error"`, `"r2_score"` `"normalized_mean_absolute_error"`	`"normalized_root_mean_squared_error"`
`training`	object	A dictionary object defining the configuration that is used in model training. Check training to find out the properties of this object.
`training_data`	object	Required A dictionary object containing the MLTable configuration defining training data to be used in as input for model training. This data is a subset of data and should be composed of both independent features/columns and target feature/column. The user can use a registered MLTable in the workspace using the format ':' (e.g Input(mltable='my_mltable:1')) OR can use a local file or folder as a MLTable(e.g Input(mltable=MLTable(local_path="./data")). This object must be provided. If target feature isn't present in source file, then Automated ML throws an error. Check training or validation or test data to find out the properties of this object.
`validation_data`	object	A dictionary object containing the MLTable configuration defining validation data to be used within Automated ML experiment for cross validation. It should be composed of both independent features/columns and target feature/column if this object is provided. Samples in training data and validation data can't overlap in a fold. See training or validation or test data to find out the properties of this object. In case this object isn't defined, then Automated ML uses `n_cross_validations` to split validation data from training data defined in `training_data` object.
`test_data`	object	A dictionary object containing the MLTable configuration defining test data to be used in test run for predictions in using best model and evaluates the model using defined metrics. It should be composed of only independent features used in training data (without target feature) if this object is provided. Check training or validation or test data to find out the properties of this object. If it isn't provided, then Automated ML uses other built-in methods to suggest best model to use for inferencing.

limits

Key	Type	Description	Allowed values	Default value
`enable_early_termination`	boolean	Represents whether to enable of experiment termination if the loss score doesn't improve after 'x' number of iterations. In an Automated ML job, no early stopping is applied on first 20 iterations. The early stopping window starts only after first 20 iterations.	`true`, `false`	`true`
`max_concurrent_trials`	integer	The maximum number of trials (children jobs) that would be executed in parallel. It's highly recommended to set the number of concurrent runs to the number of nodes in the cluster (aml compute defined in `compute`).		`1`
`max_trials`	integer	Represents the maximum number of trials an Automated ML job can try to run a training algorithm with different combination of hyperparameters. Its default value is set to 1000. If `enable_early_termination` is defined, then the number of trials used to run training algorithms can be smaller.		`1000`
`max_cores_per_trial`	integer	Represents the maximum number of cores per that are available to be used by each trial. Its default value is set to -1, which means all cores are used in the process.		`-1`
`timeout_minutes`	integer	The maximum amount of time in minutes that the submitted Automated ML job can take to run. After the specified amount of time, the job is terminated. This timeout includes setup, featurization, training runs, ensembling and model explainability (if provided) of all trials. Note that it doesn't include the ensembling and model explainability runs at the end of the process if the job fails to get completed within provided `timeout_minutes` since these features are available once all the trials (children jobs) are done. Its default value is set to 360 minutes (6 hours). To specify a timeout less than or equal to 1 hour (60 minutes), the user should make sure dataset's size isn't greater than 10,000,000 (rows times column) or an error results.		`360`
`trial_timeout_minutes`	integer	The maximum amount of time in minutes that each trial (child job) in the submitted Automated ML job can take run. After the specified amount of time, the child job will get terminated.		`30`
`exit_score`	float	The score to achieve by an experiment. The experiment terminates after the specified score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the defined `primary metric`.

forecasting

Key	Type	Description	Allowed values	Default value
`time_column_name`	string	Required The name of the column in the dataset that corresponds to the time axis of each time series. The input dataset for training, validation or test must contain this column if the task is `forecasting`. If not provided or set to `None`, Automated ML forecasting job throws an error and terminate the experiment.
`forecast_horizon`	string or integer	The maximum forecast horizon in units of time-series frequency. These units are based on the inferred time interval of your training data, (Ex: monthly, weekly) that the forecaster uses to predict. If it is set to None or `auto`, then its default value is set to 1, meaning 't+1' from the last timestamp t in the input data.	`auto`, [int]	1
`frequency`	string	The frequency at which the forecast generation is desired, for example daily, weekly, yearly, etc. If it isn't specified or set to None, then its default value is inferred from the dataset time index. The user can set its value greater than dataset's inferred frequency, but not less than it. For example, if dataset's frequency is daily, it can take values like daily, weekly, monthly, but not hourly as hourly is less than daily(24 hours). Refer to pandas documentation for more information.		`None`
`time_series_id_column_names`	string or list(strings)	The names of columns in the data to be used to group data into multiple time series. If time_series_id_column_names is not defined or set to None, the Automated ML uses auto-detection logic to detect the columns.		`None`
`feature_lags`	string	Represents if user wants to generate lags automatically for the provided numeric features. The default is set to `auto`, meaning that Automated ML uses autocorrelation-based heuristics to automatically select lag orders and generate corresponding lag features for all numeric features. "None" means no lags are generated for any numeric features.	`'auto'`, `None`	`None`
`country_or_region_for_holidays`	string	The country or region to be used to generate holiday features. These characters should be represented in ISO 3166 two-letter country/region codes, for example 'US' or 'GB'. The list of the ISO codes can be found at https://wikipedia.org/wiki/List_of_ISO_3166_country_codes.	`None`
`cv_step_size`	string or integer	The number of periods between the origin_time of one CV fold and the next fold. For example, if it is set to 3 for daily data, the origin time for each fold is three days apart. If it set to None or not specified, then it's set to `auto` by default. If it is of integer type, minimum value it can take is 1 else it raises an error.	`auto`, [int]	`auto`
`seasonality`	string or integer	The time series seasonality as an integer multiple of the series frequency. If seasonality is not specified, its value is set to `'auto'`, meaning it is inferred automatically by Automated ML. If this parameter is not set to `None`, the Automated ML assumes time series as non-seasonal, which is equivalent to setting it as integer value 1.	`'auto'`, [int]	`auto`
`short_series_handling_config`	string	Represents how Automated ML should handle short time series if specified. It takes following values: `'auto'` : short series is padded if there are no long series, otherwise short series is dropped. `'pad'`: all the short series is padded with zeros. `'drop'`: all the short series is dropped. `None`: the short series is not modified.	`'auto'`, `'pad'`, `'drop'`, `None`	`auto`
`target_aggregate_function`	string	Represents the aggregate function to be used to aggregate the target column in time series and generate the forecasts at specified frequency (defined in `freq`). If this parameter is set, but the `freq` parameter is not set, then an error is raised. It is omitted or set to None, then no aggregation is applied.	`'sum'`, `'max'`, `'min'`, `'mean'`	`auto`
`target_lags`	string or integer or list(integer)	The number of past/historical periods to use to lag from the target values based on the dataset frequency. By default, this parameter is turned off. The `'auto'` setting allows system to use automatic heuristic based lag. This lag property should be used when the relationship between the independent variables and dependent variable do not correlate by default. For more information, see Lagged features for time series forecasting in Automated ML.	`'auto'`, [int]	`None`
`target_rolling_window_size`	string or integer	The number of past observations to use for creating a rolling window average of the target column. When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.	`'auto'`, integer, `None`	`None`
`use_stl`	string	The components to generate by applying STL decomposition on time series.If not provided or set to None, no time series component is generated. use_stl can take two values: `'season'` : to generate season component. `'season_trend'` : to generate both season Automated ML and trend components.	`'season'`, `'seasontrend'`	`None`

training or validation or test data

Key	Type	Description	Allowed values	Default value
`datastore`	string	The name of the datastore where data is uploaded by user.
`path`	string	The path from where data should be loaded. It can be a `file` path, `folder` path or `pattern` for paths. `pattern` specifies a search pattern to allow globbing(`` and `*`) of files and folders containing data. Supported URI types are `azureml`, `https`, `wasbs`, `abfss`, and `adl`. For more information, see Core yaml syntax to understand how to use the `azureml://` URI format. URI of the location of the artifact file. If this URI doesn't have a scheme (for example, http:, azureml: etc.), then it's considered a local reference and the file it points to is uploaded to the default workspace blob-storage as the entity is created.
`type`	const	The type of input data. In order to generate computer vision models, the user needs to bring labeled image data as input for model training in the form of an MLTable.	`mltable`	`mltable`

training

Key	Type	Description	Allowed values	Default value
`allowed_training_algorithms`	list(string)	A list of Time Series Forecasting algorithms to try out as base model for model training in an experiment. If it is omitted or set to None, then all supported algorithms are used during experiment, except algorithms specified in `blocked_training_algorithms`.	`'auto_arima'`, `'prophet'`, `'naive'`,`'seasonal_naive'`, `'average'`, `'seasonal_average'`, `'exponential_smoothing'`, `'arimax'`, `'tcn_forecaster'`, `'elastic_net'`, `'gradient_boosting'`, `'decision_tree'`, `'knn'`, `'lasso_lars'`, `'sgd'`, `'random_forest'`, `'extreme_random_trees'`, `'light_gbm'`, `'xg_boost_regressor'`	`None`
`blocked_training_algorithms`	list(string)	A list of Time Series Forecasting algorithms to not run as base model while model training in an experiment. If it is omitted or set to None, then all supported algorithms are used during model training.	`'auto_arima'`, `'prophet'`, `'naive'`, `'seasonal_naive'`, `'average'`, `'seasonal_average'`, `'exponential_smoothing'`, `'arimax'`,`'tcn_forecaster'`, `'elastic_net'`, `'gradient_boosting'`, `'decision_tree'`, `'knn'`, `'lasso_lars'`, `'sgd'`, `'random_forest'`, `'extreme_random_trees'`, `'light_gbm'`, `'xg_boost_regressor'`	`None`
`enable_dnn_training`	boolean	A flag to turn on or off the inclusion of DNN based models to try out during model selection.	`True`, `False`	`False`
`enable_model_explainability`	boolean	Represents a flag to turn on model explainability like feature importance, of best model evaluated by Automated ML system.	`True`, `False`	`True`
`enable_vote_ensemble`	boolean	A flag to enable or disable the ensembling of some base models using Voting algorithm. For more information about ensembles, see Set up Auto train.	`true`, `false`	`true`
`enable_stack_ensemble`	boolean	A flag to enable or disable ensembling of some base models using Stacking algorithm. In forecasting tasks, this flag is turned off by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Set up Auto train.	`true`, `false`	`false`

featurization

Key	Type	Description	Allowed values	Default value
`mode`	string	The featurization mode to be used by Automated ML job. Setting it to: `'auto'` indicates whether featurization step should be done automatically `'off'` indicates no featurization<`'custom'` indicates whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.	`'auto'`, `'off'`, `'custom'`	`None`
`blocked_transformers`	list(string)	A list of transformer names to be blocked during featurization step by Automated ML, if featurization `mode` is set to 'custom'.	`'text_target_encoder'`, `'one_hot_encoder'`, `'cat_target_encoder'`, `'tf_idf'`, `'wo_e_target_encoder'`, `'label_encoder'`, `'word_embedding'`, `'naive_bayes'`, `'count_vectorizer'`, `'hash_one_hot_encoder'`	`None`
`column_name_and_types`	object	A dictionary object consisting of column names as dict key and feature types used to update column purpose as associated value, if featurization `mode` is set to 'custom'.
`transformer_params`	object	A nested dictionary object consisting of transformer name as key and corresponding customization parameters on dataset columns for featurization, if featurization `mode` is set to 'custom'. The forecasting only supports `imputer` transformer for customization. Check out column_transformers to find out how to create customization parameters.		`None`

column_transformers

Key	Type	Description	Allowed values	Default value
`fields`	list(string)	A list of column names on which provided `transformer_params` should be applied.
`parameters`	object	A dictionary object consisting of 'strategy' as key and value as imputation strategy. More details on how it can be provided, is provided in examples here.

Job outputs

Key	Type	Description	Allowed values	Default value
`type`	string	The type of job output. For the default `uri_folder` type, the output corresponds to a folder.	`uri_folder` , `mlflow_model`, `custom_model`	`uri_folder`
`mode`	string	The mode of how output file(s) are delivered to the destination storage. For read-write mount mode (`rw_mount`) the output directory is a mounted directory. For upload mode, the file(s) written are uploaded at the end of the job.	`rw_mount`, `upload`	`rw_mount`

How to run forecasting job via CLI

az ml job create --file [YOUR_CLI_YAML_FILE] --workspace-name [YOUR_AZURE_WORKSPACE] --resource-group [YOUR_AZURE_RESOURCE_GROUP] --subscription [YOUR_AZURE_SUBSCRIPTION]

Споделяне чрез