AutoMLConfig Class
Represents configuration for submitting an automated ML experiment in Azure Machine Learning.
This configuration object contains and persists the parameters for configuring the experiment run, as well as the training data to be used at run time. For guidance on selecting your settings, see https://aka.ms/AutoMLConfig.
Create an AutoMLConfig.
- Inheritance
-
builtins.objectAutoMLConfig
Constructor
AutoMLConfig(task: str, path: str | None = None, iterations: int | None = None, primary_metric: str | None = None, positive_label: Any | None = None, compute_target: Any | None = None, spark_context: Any | None = None, X: Any | None = None, y: Any | None = None, sample_weight: Any | None = None, X_valid: Any | None = None, y_valid: Any | None = None, sample_weight_valid: Any | None = None, cv_splits_indices: List[List[Any]] | None = None, validation_size: float | None = None, n_cross_validations: int | str | None = None, y_min: float | None = None, y_max: float | None = None, num_classes: int | None = None, featurization: str | FeaturizationConfig = 'auto', max_cores_per_iteration: int = 1, max_concurrent_iterations: int = 1, iteration_timeout_minutes: int | None = None, mem_in_mb: int | None = None, enforce_time_on_windows: bool = True, experiment_timeout_hours: float | None = None, experiment_exit_score: float | None = None, enable_early_stopping: bool = True, blocked_models: List[str] | None = None, blacklist_models: List[str] | None = None, exclude_nan_labels: bool = True, verbosity: int = 20, enable_tf: bool = False, model_explainability: bool = True, allowed_models: List[str] | None = None, whitelist_models: List[str] | None = None, enable_onnx_compatible_models: bool = False, enable_voting_ensemble: bool = True, enable_stack_ensemble: bool | None = None, debug_log: str = 'automl.log', training_data: Any | None = None, validation_data: Any | None = None, test_data: Any | None = None, test_size: float | None = None, label_column_name: str | None = None, weight_column_name: str | None = None, cv_split_column_names: List[str] | None = None, enable_local_managed: bool = False, enable_dnn: bool | None = None, forecasting_parameters: ForecastingParameters | None = None, **kwargs: Any)
Parameters
The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
- path
- str
The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".
- iterations
- int
The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.
The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric.
If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.
- positive_label
- Any
The positive class label that Automated Machine Learning will use to calculate binary metrics with. Binary metrics are calculated in two conditions for classification tasks:
- label column consists of two classes indicating binary classification task AutoML will use specified positive class when positive_label is passed in, otherwise AutoML will pick a positive class based on label encoded value.
- multi class classification task with positive_label specified
For more information on classification, checkout metrics for classification scenarios.
- compute_target
- AbstractComputeTarget
The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#local-remote for more information on compute targets.
- spark_context
- <xref:SparkContext>
The Spark context. Only applicable when used inside Azure Databricks/Spark environment.
- X
- DataFrame or ndarray or Dataset or TabularDataset
The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.
- y
- DataFrame or ndarray or Dataset or TabularDataset
The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.
- sample_weight
- DataFrame or ndarray or TabularDataset
The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X
.
This setting is being deprecated. Please use training_data and weight_column_name instead.
- X_valid
- DataFrame or ndarray or Dataset or TabularDataset
Validation features to use when fitting pipelines during an experiment.
If specified, then y_valid
or sample_weight_valid
must also be specified.
This setting is being deprecated. Please use validation_data and label_column_name instead.
- y_valid
- DataFrame or ndarray or Dataset or TabularDataset
Validation labels to use when fitting pipelines during an experiment.
Both X_valid
and y_valid
must be specified together.
This setting is being deprecated. Please use validation_data and label_column_name instead.
- sample_weight_valid
- DataFrame or ndarray or TabularDataset
The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X_valid
.
This setting is being deprecated. Please use validation_data and weight_column_name instead.
Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold.
To specify existing data as validation data, use validation_data
. To let AutoML extract validation
data out of training data instead, specify either n_cross_validations
or validation_size
.
Use cv_split_column_names
if you have cross validation column(s) in training_data
.
- validation_size
- float
What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive.
Specify validation_data
to provide validation data, otherwise set n_cross_validations
or
validation_size
to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names
.
For more information, see Configure data splits and cross-validation in automated machine learning.
- n_cross_validations
- int
How many cross validations to perform when user validation data is not specified.
Specify validation_data
to provide validation data, otherwise set n_cross_validations
or
validation_size
to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names
.
For more information, see Configure data splits and cross-validation in automated machine learning.
- y_min
- float
Minimum value of y for a regression experiment. The combination of y_min
and y_max
are used to
normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this
value will be computed from the data.
- y_max
- float
Maximum value of y for a regression experiment. The combination of y_min
and y_max
are used to
normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this
value will be computed from the data.
- num_classes
- int
The number of classes in the label data for a classification experiment. This setting is being deprecated. Instead, this value will be computed from the data.
- featurization
- str or FeaturizationConfig
'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.
Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:
Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.
Numeric: Impute missing values, cluster distance, weight of evidence.
DateTime: Several features such as day, seconds, minutes, hours etc.
Text: Bag of words, pre-trained Word embedding, text target encoding.
More details can be found in the article Configure automated ML experiments in Python.
To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering.
Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.
- max_cores_per_iteration
- int
The maximum number of threads to use for a given training iteration. Acceptable values:
Greater than 1 and less than or equal to the maximum number of cores on the compute target.
Equal to -1, which means to use all the possible cores per iteration per child-run.
Equal to 1, the default.
- max_concurrent_iterations
- int
Represents the maximum number of iterations that would be executed in parallel. The default value is 1.
AmlCompute clusters support one interation running per node. For multiple AutoML experiment parent runs executed in parallel on a single AmlCompute cluster, the sum of the
max_concurrent_iterations
values for all experiments should be less than or equal to the maximum number of nodes. Otherwise, runs will be queued until nodes are available.DSVM supports multiple iterations per node.
max_concurrent_iterations
should be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of themax_concurrent_iterations
values for all experiments should be less than or equal to the maximum number of nodes.Databricks -
max_concurrent_iterations
should be less than or equal to the number of worker nodes on Databricks.
max_concurrent_iterations
does not apply to local runs. Formerly, this parameter
was named concurrent_iterations
.
- iteration_timeout_minutes
- int
Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
- mem_in_mb
- int
Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.
- enforce_time_on_windows
- bool
Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.
- experiment_timeout_hours
- float
Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.
- experiment_exit_score
- float
Target score for experiment. The experiment terminates after this score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the primary metric. For for more information on exit criteria, see this article.
- enable_early_stopping
- bool
Whether to enable early termination if the score is not improving in the short term. The default is True.
Early stopping logic:
No early stopping for first 20 iterations (landmarks).
Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations
(currently set to 10). This means that the first iteration where stopping can occur is the 31st.
AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in
higher scores.
Early stopping is triggered if the absolute value of best score calculated is the same for past
early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.
- blocked_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
A list of algorithms to ignore for an experiment. If enable_tf
is False, TensorFlow models
are included in blocked_models
.
- blacklist_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
Deprecated parameter, use blocked_models instead.
- exclude_nan_labels
- bool
Whether to exclude rows with NaN values in the label. The default is True.
- verbosity
- int
The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.
- enable_tf
- bool
Deprecated parameter to enable/disable Tensorflow algorithms. The default is False.
- model_explainability
- bool
Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.
- allowed_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
A list of model names to search for an experiment. If not specified, then all models supported
for the task are used minus any specified in blocked_models
or deprecated TensorFlow models.
The supported models for each task type are described in the
SupportedModels class.
- whitelist_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
Deprecated parameter, use allowed_models instead.
- enable_onnx_compatible_models
- bool
Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.
- forecasting_parameters
- ForecastingParameters
A ForecastingParameters object to hold all the forecasting specific parameters.
- time_column_name
- str
The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead.
- max_horizon
- int
The desired maximum forecast horizon in units of time-series frequency. The default value is 1.
Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model. This setting is being deprecated. Please use forecasting_parameters instead.
The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. This setting is being deprecated. Please use forecasting_parameters instead.
The number of past periods to lag from the target column. The default is 1. This setting is being deprecated. Please use forecasting_parameters instead.
When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.
- feature_lags
- str
Flag for generating lags for the numeric features. This setting is being deprecated. Please use forecasting_parameters instead.
- target_rolling_window_size
- int
The number of past periods used to create a rolling window average of the target column. This setting is being deprecated. Please use forecasting_parameters instead.
When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.
- country_or_region
- str
The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region code, for example 'US' or 'GB'. This setting is being deprecated. Please use forecasting_parameters instead.
- use_stl
- str
Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components. This setting is being deprecated. Please use forecasting_parameters instead.
Set time series seasonality. If seasonality is set to 'auto', it will be inferred. This setting is being deprecated. Please use forecasting_parameters instead.
- short_series_handling_configuration
- str
The parameter defining how if AutoML should handle short time series.
Possible values: 'auto' (default), 'pad', 'drop' and None.
- auto short series will be padded if there are no long series, otherwise short series will be dropped.
- pad all the short series will be padded.
- drop all the short series will be dropped".
- None the short series will not be modified. If set to 'pad', the table will be padded with the zeroes and empty values for the regressors and random values for target with the mean equal to target value median for given time series id. If median is more or equal to zero, the minimal padded value will be clipped by zero: Input:
Date
numeric_value
string
target
2020-01-01
23
green
55
Output assuming minimal number of values is four:
Date
numeric_value
string
target
2019-12-29
0
NA
55.1
2019-12-30
0
NA
55.6
2019-12-31
0
NA
54.5
2020-01-01
23
green
55
Note: We have two parameters short_series_handling_configuration and legacy short_series_handling. When both parameters are set we are synchronize them as shown in the table below (short_series_handling_configuration and short_series_handling for brevity are marked as handling_configuration and handling respectively).
handling
handling_configuration
resulting handling
resulting handling_configuration
True
auto
True
auto
True
pad
True
auto
True
drop
True
auto
True
None
False
None
False
auto
False
None
False
pad
False
None
False
drop
False
None
False
None
False
None
Forecast frequency.
When forecasting, this parameter represents the period with which the forecast is desired, for example daily, weekly, yearly, etc. The forecast frequency is dataset frequency by default. You can optionally set it to greater (but not lesser) than dataset frequency. We'll aggregate the data and generate the results at forecast frequency. For example, for daily data, you can set the frequency to be daily, weekly or monthly, but not hourly. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
The function to be used to aggregate the time series target column to conform to a user specified frequency. If the target_aggregation_function is set, but the freq parameter is not set, the error is raised. The possible target aggregation functions are: "sum", "max", "min" and "mean".
freq
target_aggregation_function
Data regularity fixing mechanism
None (Default)
None (Default)
The aggregation is not applied.If the valid frequency can not bedetermined the error will be raised.
Some Value
None (Default)
The aggregation is not applied.If the number of data points compliantto given frequency grid is less then 90%these points will be removed, otherwisethe error will be raised.
None (Default)
Aggregation function
The error about missing frequency parameteris raised.
Some Value
Aggregation function
Aggregate to frequency using providedaggregation function.
- enable_voting_ensemble
- bool
Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.
- enable_stack_ensemble
- bool
Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.
- debug_log
- str
The log file to write debug information to. If not specified, 'automl.log' is used.
- training_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The training data to be used within the experiment.
It should contain both training features and a label column (optionally a sample weights column).
If training_data
is specified, then the label_column_name
parameter must also be specified.
training_data
was introduced in version 1.0.81.
- validation_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The validation data to be used within the experiment.
It should contain both training features and label column (optionally a sample weights column).
If validation_data
is specified, then training_data
and label_column_name
parameters must
be specified.
validation_data
was introduced in version 1.0.81. For more information, see
Configure data splits and cross-validation in automated machine learning.
- test_data
- Dataset or TabularDataset
The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. The test data to be used for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.
If this parameter or the test_size
parameter are not specified then
no test run will be executed automatically after model training is completed.
Test data should contain both features and label column.
If test_data
is specified then the label_column_name
parameter must be specified.
- test_size
- float
The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. What fraction of the training data to hold out for test data for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.
This should be between 0.0 and 1.0 non-inclusive.
If test_size
is specified at the same time as validation_size
,
then the test data is split from training_data
before the validation data is split.
For example, if validation_size=0.1
, test_size=0.1
and the original training data has 1000 rows,
then the test data will have 100 rows, the validation data will contain 90 rows
and the training data will have 810 rows.
For regression based tasks, random sampling is used. For classification tasks, stratified sampling is used. Forecasting does not currently support specifying a test dataset using a train/test split.
If this parameter or the test_data
parameter are not specified then
no test run will be executed automatically after model training is completed.
The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data
, validation_data
and test_data
parameters.
label_column_name
was introduced in version 1.0.81.
The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data
and validation_data
parameters.
weight_column_names
was introduced in version 1.0.81.
List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation.
This parameter is applicable to training_data
parameter for custom cross validation purposes.
cv_split_column_names
was introduced in version 1.6.0
Use either cv_split_column_names
or cv_splits_indices
.
For more information, see Configure data splits and cross-validation in automated machine learning.
- enable_local_managed
- bool
Disabled parameter. Local managed runs can not be enabled at this time.
- enable_dnn
- bool
Whether to include DNN based models during model selection. The default in the init is None. However, the default is True for DNN NLP tasks, and it's False for all other AutoML tasks.
The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
- path
- str
The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".
- iterations
- int
The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.
The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric.
If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.
- positive_label
- Any
The positive class label that Automated Machine Learning will use to calculate binary metrics with. Binary metrics are calculated in two conditions for classification tasks:
- label column consists of two classes indicating binary classification task AutoML will use specified positive class when positive_label is passed in, otherwise AutoML will pick a positive class based on label encoded value.
- multi class classification task with positive_label specified
For more information on classification, checkout metrics for classification scenarios.
- compute_target
- AbstractComputeTarget
The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-remote for more information on compute targets.
- spark_context
- <xref:SparkContext>
The Spark context. Only applicable when used inside Azure Databricks/Spark environment.
- X
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.
- y
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.
- sample_weight
- DataFrame or ndarray or TabularDataset
The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X
.
This setting is being deprecated. Please use training_data and weight_column_name instead.
- X_valid
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
Validation features to use when fitting pipelines during an experiment.
If specified, then y_valid
or sample_weight_valid
must also be specified.
This setting is being deprecated. Please use validation_data and label_column_name instead.
- y_valid
- DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset
Validation labels to use when fitting pipelines during an experiment.
Both X_valid
and y_valid
must be specified together.
This setting is being deprecated. Please use validation_data and label_column_name instead.
- sample_weight_valid
- DataFrame or ndarray or TabularDataset
The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data.
Specify this parameter when specifying X_valid
.
This setting is being deprecated. Please use validation_data and weight_column_name instead.
Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold. This option is supported when data is passed as separate Features dataset and Label column.
To specify existing data as validation data, use validation_data
. To let AutoML extract validation
data out of training data instead, specify either n_cross_validations
or validation_size
.
Use cv_split_column_names
if you have cross validation column(s) in training_data
.
- validation_size
- float
What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive.
Specify validation_data
to provide validation data, otherwise set n_cross_validations
or
validation_size
to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names
.
For more information, see Configure data splits and cross-validation in automated machine learning.
How many cross validations to perform when user validation data is not specified.
Specify validation_data
to provide validation data, otherwise set n_cross_validations
or
validation_size
to extract validation data out of the specified training data.
For custom cross validation fold, use cv_split_column_names
.
For more information, see Configure data splits and cross-validation in automated machine learning.
- y_min
- float
Minimum value of y for a regression experiment. The combination of y_min
and y_max
are used to
normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this
value will be computed from the data.
- y_max
- float
Maximum value of y for a regression experiment. The combination of y_min
and y_max
are used to
normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this
value will be computed from the data.
- num_classes
- int
The number of classes in the label data for a classification experiment. This setting is being deprecated. Instead, this value will be computed from the data.
- featurization
- str or FeaturizationConfig
'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on.
Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows:
Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values.
Numeric: Impute missing values, cluster distance, weight of evidence.
DateTime: Several features such as day, seconds, minutes, hours etc.
Text: Bag of words, pre-trained Word embedding, text target encoding.
More details can be found in the article Configure automated ML experiments in Python.
To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering.
Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.
- max_cores_per_iteration
- int
The maximum number of threads to use for a given training iteration. Acceptable values:
Greater than 1 and less than or equal to the maximum number of cores on the compute target.
Equal to -1, which means to use all the possible cores per iteration per child-run.
Equal to 1, the default value.
- max_concurrent_iterations
- int
Represents the maximum number of iterations that would be executed in parallel. The default value is 1.
AmlCompute clusters support one interation running per node. For multiple experiments run in parallel on a single AmlCompute cluster, the sum of the
max_concurrent_iterations
values for all experiments should be less than or equal to the maximum number of nodes.DSVM supports multiple iterations per node.
max_concurrent_iterations
should be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of themax_concurrent_iterations
values for all experiments should be less than or equal to the maximum number of nodes.Databricks -
max_concurrent_iterations
should be less than or equal to the number of worker nodes on Databricks.
max_concurrent_iterations
does not apply to local runs. Formerly, this parameter
was named concurrent_iterations
.
- iteration_timeout_minutes
- int
Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
- mem_in_mb
- int
Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.
- enforce_time_on_windows
- bool
Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.
- experiment_timeout_hours
- float
Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.
- experiment_exit_score
- float
Target score for experiment. The experiment terminates after this score is reached.
If not specified (no criteria), the experiment runs until no further progress is made
on the primary metric. For for more information on exit criteria, see this >>article
/azure/machine-learning/how-to-configure-auto-train#exit-criteria`_<<.
- enable_early_stopping
- bool
Whether to enable early termination if the score is not improving in the short term. The default is True.
Early stopping logic:
No early stopping for first 20 iterations (landmarks).
Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations (currently set to 10). This means that the first iteration where stopping can occur is the 31st.
AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in higher scores.
Early stopping is triggered if the absolute value of best score calculated is the same for past early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.
- blocked_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
A list of algorithms to ignore for an experiment. If enable_tf
is False, TensorFlow models
are included in blocked_models
.
- blacklist_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
Deprecated parameter, use blocked_models instead.
- exclude_nan_labels
- bool
Whether to exclude rows with NaN values in the label. The default is True.
- verbosity
- int
The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.
- model_explainability
- bool
Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.
- allowed_models
- list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task>
A list of model names to search for an experiment. If not specified, then all models supported
for the task are used minus any specified in blocked_models
or deprecated TensorFlow models.
The supported models for each task type are described in the
SupportedModels class.
- allowed_models
A list of model names to search for an experiment. If not specified, then all models supported
for the task are used minus any specified in blocked_models
or deprecated TensorFlow models.
The supported models for each task type are described in the
SupportedModels class.
- whitelist_models
Deprecated parameter, use allowed_models instead.
- enable_onnx_compatible_models
- bool
Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.
- forecasting_parameters
- ForecastingParameters
An object to hold all the forecasting specific parameters.
- time_column_name
- str
The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead.
- max_horizon
- int
The desired maximum forecast horizon in units of time-series frequency. The default value is 1. This setting is being deprecated. Please use forecasting_parameters instead.
Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model.
The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. This setting is being deprecated. Please use forecasting_parameters instead.
The number of past periods to lag from the target column. The default is 1. This setting is being deprecated. Please use forecasting_parameters instead.
When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.
- feature_lags
- str
Flag for generating lags for the numeric features. This setting is being deprecated. Please use forecasting_parameters instead.
- target_rolling_window_size
- int
The number of past periods used to create a rolling window average of the target column. This setting is being deprecated. Please use forecasting_parameters instead.
When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.
- country_or_region
- str
The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes, for example 'US' or 'GB'. This setting is being deprecated. Please use forecasting_parameters instead.
- use_stl
- str
Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components. This setting is being deprecated. Please use forecasting_parameters instead.
- seasonality
- int
Set time series seasonality. If seasonality is set to -1, it will be inferred. If use_stl is not set, this parameter will not be used. This setting is being deprecated. Please use forecasting_parameters instead.
- short_series_handling_configuration
- str
The parameter defining how if AutoML should handle short time series.
Possible values: 'auto' (default), 'pad', 'drop' and None.
- auto short series will be padded if there are no long series, otherwise short series will be dropped.
- pad all the short series will be padded.
- drop all the short series will be dropped".
- None the short series will not be modified. If set to 'pad', the table will be padded with the zeroes and empty values for the regressors and random values for target with the mean equal to target value median for given time series id. If median is more or equal to zero, the minimal padded value will be clipped by zero: Input:
Date
numeric_value
string
target
2020-01-01
23
green
55
Output assuming minimal number of values is four: +————+—————+———-+——–+ | Date | numeric_value | string | target | +============+===============+==========+========+ | 2019-12-29 | 0 | NA | 55.1 | +————+—————+———-+——–+ | 2019-12-30 | 0 | NA | 55.6 | +————+—————+———-+——–+ | 2019-12-31 | 0 | NA | 54.5 | +————+—————+———-+——–+ | 2020-01-01 | 23 | green | 55 | +————+—————+———-+——–+
Note: We have two parameters short_series_handling_configuration and legacy short_series_handling. When both parameters are set we are synchronize them as shown in the table below (short_series_handling_configuration and short_series_handling for brevity are marked as handling_configuration and handling respectively).
handling
handling_configuration
resulting handling
resulting handling_configuration
True
auto
True
auto
True
pad
True
auto
True
drop
True
auto
True
None
False
None
False
auto
False
None
False
pad
False
None
False
drop
False
None
False
None
False
None
Forecast frequency.
When forecasting, this parameter represents the period with which the forecast is desired, for example daily, weekly, yearly, etc. The forecast frequency is dataset frequency by default. You can optionally set it to greater (but not lesser) than dataset frequency. We'll aggregate the data and generate the results at forecast frequency. For example, for daily data, you can set the frequency to be daily, weekly or monthly, but not hourly. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
The function to be used to aggregate the time series target column to conform to a user specified frequency. If the target_aggregation_function is set, but the freq parameter is not set, the error is raised. The possible target aggregation functions are: "sum", "max", "min" and "mean".
freq
target_aggregation_function
Data regularity fixing mechanism
None (Default)
None (Default)
The aggregation is not applied.If the valid frequency can not bedetermined the error will be raised.
Some Value
None (Default)
The aggregation is not applied.If the number of data points compliantto given frequency grid is less then 90%these points will be removed, otherwisethe error will be raised.
None (Default)
Aggregation function
The error about missing frequency parameteris raised.
Some Value
Aggregation function
Aggregate to frequency using providedaggregation function.
- enable_voting_ensemble
- bool
Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.
- enable_stack_ensemble
- bool
Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.
- debug_log
- str
The log file to write debug information to. If not specified, 'automl.log' is used.
- training_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The training data to be used within the experiment.
It should contain both training features and a label column (optionally a sample weights column).
If training_data
is specified, then the label_column_name
parameter must also be specified.
training_data
was introduced in version 1.0.81.
- validation_data
- DataFrame or Dataset or DatasetDefinition or TabularDataset
The validation data to be used within the experiment.
It should contain both training features and label column (optionally a sample weights column).
If validation_data
is specified, then training_data
and label_column_name
parameters must
be specified.
validation_data
was introduced in version 1.0.81. For more information, see
Configure data splits and cross-validation in automated machine learning.
- test_data
- Dataset or TabularDataset
The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. The test data to be used for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.
If this parameter or the test_size
parameter are not specified then
no test run will be executed automatically after model training is completed.
Test data should contain both features and label column.
If test_data
is specified then the label_column_name
parameter must be specified.
- test_size
- float
The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. What fraction of the training data to hold out for test data for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions.
This should be between 0.0 and 1.0 non-inclusive.
If test_size
is specified at the same time as validation_size
,
then the test data is split from training_data
before the validation data is split.
For example, if validation_size=0.1
, test_size=0.1
and the original training data has 1000 rows,
then the test data will have 100 rows, the validation data will contain 90 rows
and the training data will have 810 rows.
For regression based tasks, random sampling is used. For classification tasks, stratified sampling is used. Forecasting does not currently support specifying a test dataset using a train/test split.
If this parameter or the test_data
parameter are not specified then
no test run will be executed automatically after model training is completed.
The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data
, validation_data
and test_data
parameters.
label_column_name
was introduced in version 1.0.81.
The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers.
This parameter is applicable to training_data
and validation_data
parameters.
weight_column_names
was introduced in version 1.0.81.
List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation.
This parameter is applicable to training_data
parameter for custom cross validation purposes.
cv_split_column_names
was introduced in version 1.6.0
Use either cv_split_column_names
or cv_splits_indices
.
For more information, see Configure data splits and cross-validation in automated machine learning.
- enable_local_managed
- bool
Disabled parameter. Local managed runs can not be enabled at this time.
- enable_dnn
- bool
Whether to include DNN based models during model selection. The default in the init is None. However, the default is True for DNN NLP tasks, and it's False for all other AutoML tasks.
Remarks
The following code shows a basic example of creating an AutoMLConfig object and submitting an experiment for regression:
automl_settings = {
"n_cross_validations": 3,
"primary_metric": 'r2_score',
"enable_early_stopping": True,
"experiment_timeout_hours": 1.0,
"max_concurrent_iterations": 4,
"max_cores_per_iteration": -1,
"verbosity": logging.INFO,
}
automl_config = AutoMLConfig(task = 'regression',
compute_target = compute_target,
training_data = train_data,
label_column_name = label,
**automl_settings
)
ws = Workspace.from_config()
experiment = Experiment(ws, "your-experiment-name")
run = experiment.submit(automl_config, show_output=True)
A full sample is available at Regression
Examples of using AutoMLConfig for forecasting are in these notebooks:
Examples of using AutoMLConfig for all task types can be found in these automated ML notebooks.
For background on automated ML, see the articles:
Configure automated ML experiments in Python. In this article, there is information about the different algorithms and primary metrics used for each task type.
Auto-train a time-series forecast model. In this article, there is information about which constructor parameters and
**kwargs
are used in forecasting.
For more information about different options for configuring training/validation data splits and cross-validation for your automated machine learning, AutoML, experiments, see Configure data splits and cross-validation in automated machine learning.
Methods
as_serializable_dict |
Convert the object into dictionary. |
get_supported_dataset_languages |
Get supported languages and their corresponding language codes in ISO 639-3. |
as_serializable_dict
Convert the object into dictionary.
as_serializable_dict() -> Dict[str, Any]
get_supported_dataset_languages
Get supported languages and their corresponding language codes in ISO 639-3.
get_supported_dataset_languages(use_gpu: bool) -> Dict[Any, Any]
Parameters
- use_gpu
boolean indicating whether gpu compute is being used or not.
Returns
dictionary of format {: }. Language code adheres to ISO 639-3 standard, please refer to https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes
Feedback
Submit and view feedback for