AutoMLConfig Class

Reference

Represents configuration for submitting an automated ML experiment in Azure Machine Learning.

This configuration object contains and persists the parameters for configuring the experiment run, as well as the training data to be used at run time. For guidance on selecting your settings, see https://aka.ms/AutoMLConfig.

Create an AutoMLConfig.

Inheritance: builtins.object

AutoMLConfig

Constructor

AutoMLConfig(task: str, path: str | None = None, iterations: int | None = None, primary_metric: str | None = None, positive_label: Any | None = None, compute_target: Any | None = None, spark_context: Any | None = None, X: Any | None = None, y: Any | None = None, sample_weight: Any | None = None, X_valid: Any | None = None, y_valid: Any | None = None, sample_weight_valid: Any | None = None, cv_splits_indices: List[List[Any]] | None = None, validation_size: float | None = None, n_cross_validations: int | str | None = None, y_min: float | None = None, y_max: float | None = None, num_classes: int | None = None, featurization: str | FeaturizationConfig = 'auto', max_cores_per_iteration: int = 1, max_concurrent_iterations: int = 1, iteration_timeout_minutes: int | None = None, mem_in_mb: int | None = None, enforce_time_on_windows: bool = True, experiment_timeout_hours: float | None = None, experiment_exit_score: float | None = None, enable_early_stopping: bool = True, blocked_models: List[str] | None = None, blacklist_models: List[str] | None = None, exclude_nan_labels: bool = True, verbosity: int = 20, enable_tf: bool = False, model_explainability: bool = True, allowed_models: List[str] | None = None, whitelist_models: List[str] | None = None, enable_onnx_compatible_models: bool = False, enable_voting_ensemble: bool = True, enable_stack_ensemble: bool | None = None, debug_log: str = 'automl.log', training_data: Any | None = None, validation_data: Any | None = None, test_data: Any | None = None, test_size: float | None = None, label_column_name: str | None = None, weight_column_name: str | None = None, cv_split_column_names: List[str] | None = None, enable_local_managed: bool = False, enable_dnn: bool | None = None, forecasting_parameters: ForecastingParameters | None = None, **kwargs: Any)

Parameters

Name	Description
task Required	str or Tasks The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
path Required	str The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".
iterations Required	int The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.
primary_metric Required	str or Metric The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric. If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.
positive_label Required	Any The positive class label that Automated Machine Learning will use to calculate binary metrics with. Binary metrics are calculated in two conditions for classification tasks: label column consists of two classes indicating binary classification task AutoML will use specified positive class when positive_label is passed in, otherwise AutoML will pick a positive class based on label encoded value. multi class classification task with positive_label specified For more information on classification, checkout metrics for classification scenarios.
compute_target Required	AbstractComputeTarget The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/en-us/azure/machine-learning/concept-automated-ml#local-remote for more information on compute targets.
spark_context Required	<xref:SparkContext> The Spark context. Only applicable when used inside Azure Databricks/Spark environment.
X Required	DataFrame or ndarray or Dataset or TabularDataset The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.
y Required	DataFrame or ndarray or Dataset or TabularDataset The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.
sample_weight Required	DataFrame or ndarray or TabularDataset The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data. Specify this parameter when specifying `X`. This setting is being deprecated. Please use training_data and weight_column_name instead.
X_valid Required	DataFrame or ndarray or Dataset or TabularDataset Validation features to use when fitting pipelines during an experiment. If specified, then `y_valid` or `sample_weight_valid` must also be specified. This setting is being deprecated. Please use validation_data and label_column_name instead.
y_valid Required	DataFrame or ndarray or Dataset or TabularDataset Validation labels to use when fitting pipelines during an experiment. Both `X_valid` and `y_valid` must be specified together. This setting is being deprecated. Please use validation_data and label_column_name instead.
sample_weight_valid Required	DataFrame or ndarray or TabularDataset The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data. Specify this parameter when specifying `X_valid`. This setting is being deprecated. Please use validation_data and weight_column_name instead.
cv_splits_indices Required	List[List[ndarray]] Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold. To specify existing data as validation data, use `validation_data`. To let AutoML extract validation data out of training data instead, specify either `n_cross_validations` or `validation_size`. Use `cv_split_column_names` if you have cross validation column(s) in `training_data`.
validation_size Required	float What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive. Specify `validation_data` to provide validation data, otherwise set `n_cross_validations` or `validation_size` to extract validation data out of the specified training data. For custom cross validation fold, use `cv_split_column_names`. For more information, see Configure data splits and cross-validation in automated machine learning.
n_cross_validations Required	int How many cross validations to perform when user validation data is not specified. Specify `validation_data` to provide validation data, otherwise set `n_cross_validations` or `validation_size` to extract validation data out of the specified training data. For custom cross validation fold, use `cv_split_column_names`. For more information, see Configure data splits and cross-validation in automated machine learning.
y_min Required	float Minimum value of y for a regression experiment. The combination of `y_min` and `y_max` are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.
y_max Required	float Maximum value of y for a regression experiment. The combination of `y_min` and `y_max` are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.
num_classes Required	int The number of classes in the label data for a classification experiment. This setting is being deprecated. Instead, this value will be computed from the data.
featurization Required	str or FeaturizationConfig 'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on. Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows: Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values. Numeric: Impute missing values, cluster distance, weight of evidence. DateTime: Several features such as day, seconds, minutes, hours etc. Text: Bag of words, pre-trained Word embedding, text target encoding. More details can be found in the article Configure automated ML experiments in Python. To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering. Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.
max_cores_per_iteration Required	int The maximum number of threads to use for a given training iteration. Acceptable values: Greater than 1 and less than or equal to the maximum number of cores on the compute target. Equal to -1, which means to use all the possible cores per iteration per child-run. Equal to 1, the default.
max_concurrent_iterations Required	int Represents the maximum number of iterations that would be executed in parallel. The default value is 1. AmlCompute clusters support one interation running per node. For multiple AutoML experiment parent runs executed in parallel on a single AmlCompute cluster, the sum of the `max_concurrent_iterations` values for all experiments should be less than or equal to the maximum number of nodes. Otherwise, runs will be queued until nodes are available. DSVM supports multiple iterations per node. `max_concurrent_iterations` should be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of the `max_concurrent_iterations` values for all experiments should be less than or equal to the maximum number of nodes. Databricks - `max_concurrent_iterations` should be less than or equal to the number of worker nodes on Databricks. `max_concurrent_iterations` does not apply to local runs. Formerly, this parameter was named `concurrent_iterations`.
iteration_timeout_minutes Required	int Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
mem_in_mb Required	int Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.
enforce_time_on_windows Required	bool Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.
experiment_timeout_hours Required	float Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.
experiment_exit_score Required	float Target score for experiment. The experiment terminates after this score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the primary metric. For for more information on exit criteria, see this article.
enable_early_stopping Required	bool Whether to enable early termination if the score is not improving in the short term. The default is True. Early stopping logic: No early stopping for first 20 iterations (landmarks). Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations (currently set to 10). This means that the first iteration where stopping can occur is the 31st. AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in higher scores. Early stopping is triggered if the absolute value of best score calculated is the same for past early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.
blocked_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> A list of algorithms to ignore for an experiment. If `enable_tf` is False, TensorFlow models are included in `blocked_models`.
blacklist_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> Deprecated parameter, use blocked_models instead.
exclude_nan_labels Required	bool Whether to exclude rows with NaN values in the label. The default is True.
verbosity Required	int The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.
enable_tf Required	bool Deprecated parameter to enable/disable Tensorflow algorithms. The default is False.
model_explainability Required	bool Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.
allowed_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> A list of model names to search for an experiment. If not specified, then all models supported for the task are used minus any specified in `blocked_models` or deprecated TensorFlow models. The supported models for each task type are described in the SupportedModels class.
whitelist_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> Deprecated parameter, use allowed_models instead.
enable_onnx_compatible_models Required	bool Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.
forecasting_parameters Required	ForecastingParameters A ForecastingParameters object to hold all the forecasting specific parameters.
time_column_name Required	str The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead.
max_horizon Required	int The desired maximum forecast horizon in units of time-series frequency. The default value is 1. Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model. This setting is being deprecated. Please use forecasting_parameters instead.
grain_column_names Required	str or list(str) The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. This setting is being deprecated. Please use forecasting_parameters instead.
target_lags Required	int or list(int) The number of past periods to lag from the target column. The default is 1. This setting is being deprecated. Please use forecasting_parameters instead. When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.
feature_lags Required	str Flag for generating lags for the numeric features. This setting is being deprecated. Please use forecasting_parameters instead.
target_rolling_window_size Required	int The number of past periods used to create a rolling window average of the target column. This setting is being deprecated. Please use forecasting_parameters instead. When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.
country_or_region Required	str The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region code, for example 'US' or 'GB'. This setting is being deprecated. Please use forecasting_parameters instead.
use_stl Required	str Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components. This setting is being deprecated. Please use forecasting_parameters instead.
seasonality Required	int or str Set time series seasonality. If seasonality is set to 'auto', it will be inferred. This setting is being deprecated. Please use forecasting_parameters instead.
short_series_handling_configuration Required	str The parameter defining how if AutoML should handle short time series. Possible values: 'auto' (default), 'pad', 'drop' and None. auto short series will be padded if there are no long series, otherwise short series will be dropped. pad all the short series will be padded. drop all the short series will be dropped". None the short series will not be modified. If set to 'pad', the table will be padded with the zeroes and empty values for the regressors and random values for target with the mean equal to target value median for given time series id. If median is more or equal to zero, the minimal padded value will be clipped by zero: Input: Date numeric_value string target 2020-01-01 23 green 55 Output assuming minimal number of values is four: Date numeric_value string target 2019-12-29 0 NA 55.1 2019-12-30 0 NA 55.6 2019-12-31 0 NA 54.5 2020-01-01 23 green 55 Note: We have two parameters short_series_handling_configuration and legacy short_series_handling. When both parameters are set we are synchronize them as shown in the table below (short_series_handling_configuration and short_series_handling for brevity are marked as handling_configuration and handling respectively). handling handling_configuration resulting handling resulting handling_configuration True auto True auto True pad True auto True drop True auto True None False None False auto False None False pad False None False drop False None False None False None
freq Required	str or None Forecast frequency. When forecasting, this parameter represents the period with which the forecast is desired, for example daily, weekly, yearly, etc. The forecast frequency is dataset frequency by default. You can optionally set it to greater (but not lesser) than dataset frequency. We'll aggregate the data and generate the results at forecast frequency. For example, for daily data, you can set the frequency to be daily, weekly or monthly, but not hourly. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
target_aggregation_function Required	str or None The function to be used to aggregate the time series target column to conform to a user specified frequency. If the target_aggregation_function is set, but the freq parameter is not set, the error is raised. The possible target aggregation functions are: "sum", "max", "min" and "mean". freq target_aggregation_function Data regularity fixing mechanism None (Default) None (Default) The aggregation is not applied.If the valid frequency can not bedetermined the error will be raised. Some Value None (Default) The aggregation is not applied.If the number of data points compliantto given frequency grid is less then 90%these points will be removed, otherwisethe error will be raised. None (Default) Aggregation function The error about missing frequency parameteris raised. Some Value Aggregation function Aggregate to frequency using providedaggregation function.
enable_voting_ensemble Required	bool Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.
enable_stack_ensemble Required	bool Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.
debug_log Required	str The log file to write debug information to. If not specified, 'automl.log' is used.
training_data Required	DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If `training_data` is specified, then the `label_column_name` parameter must also be specified. `training_data` was introduced in version 1.0.81.
validation_data Required	DataFrame or Dataset or DatasetDefinition or TabularDataset The validation data to be used within the experiment. It should contain both training features and label column (optionally a sample weights column). If `validation_data` is specified, then `training_data` and `label_column_name` parameters must be specified. `validation_data` was introduced in version 1.0.81. For more information, see Configure data splits and cross-validation in automated machine learning.
test_data Required	Dataset or TabularDataset The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. The test data to be used for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions. If this parameter or the `test_size` parameter are not specified then no test run will be executed automatically after model training is completed. Test data should contain both features and label column. If `test_data` is specified then the `label_column_name` parameter must be specified.
test_size Required	float The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. What fraction of the training data to hold out for test data for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions. This should be between 0.0 and 1.0 non-inclusive. If `test_size` is specified at the same time as `validation_size`, then the test data is split from `training_data` before the validation data is split. For example, if `validation_size=0.1`, `test_size=0.1` and the original training data has 1000 rows, then the test data will have 100 rows, the validation data will contain 90 rows and the training data will have 810 rows. For regression based tasks, random sampling is used. For classification tasks, stratified sampling is used. Forecasting does not currently support specifying a test dataset using a train/test split. If this parameter or the `test_data` parameter are not specified then no test run will be executed automatically after model training is completed.
label_column_name Required	Union[str, int] The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. This parameter is applicable to `training_data`, `validation_data` and `test_data` parameters. `label_column_name` was introduced in version 1.0.81.
weight_column_name Required	Union[str, int] The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. This parameter is applicable to `training_data` and `validation_data` parameters. `weight_column_names` was introduced in version 1.0.81.
cv_split_column_names Required	list(str) List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation. This parameter is applicable to `training_data` parameter for custom cross validation purposes. `cv_split_column_names` was introduced in version 1.6.0 Use either `cv_split_column_names` or `cv_splits_indices`. For more information, see Configure data splits and cross-validation in automated machine learning.
enable_local_managed Required	bool Disabled parameter. Local managed runs can not be enabled at this time.
enable_dnn Required	bool Whether to include DNN based models during model selection. The default in the init is None. However, the default is True for DNN NLP tasks, and it's False for all other AutoML tasks.
task Required	str or Tasks The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
path Required	str The full path to the Azure Machine Learning project folder. If not specified, the default is to use the current directory or ".".
iterations Required	int The total number of different algorithm and parameter combinations to test during an automated ML experiment. If not specified, the default is 1000 iterations.
primary_metric Required	str or Metric The metric that Automated Machine Learning will optimize for model selection. Automated Machine Learning collects more metrics than it can optimize. You can use get_primary_metrics to get a list of valid metrics for your given task. For more information on how metrics are calculated, see https://docs.microsoft.com/azure/machine-learning/how-to-configure-auto-train#primary-metric. If not specified, accuracy is used for classification tasks, normalized root mean squared is used for forecasting and regression tasks, accuracy is used for image classification and image multi label classification, and mean average precision is used for image object detection.
positive_label Required	Any The positive class label that Automated Machine Learning will use to calculate binary metrics with. Binary metrics are calculated in two conditions for classification tasks: label column consists of two classes indicating binary classification task AutoML will use specified positive class when positive_label is passed in, otherwise AutoML will pick a positive class based on label encoded value. multi class classification task with positive_label specified For more information on classification, checkout metrics for classification scenarios.
compute_target Required	AbstractComputeTarget The Azure Machine Learning compute target to run the Automated Machine Learning experiment on. See https://docs.microsoft.com/azure/machine-learning/how-to-auto-train-remote for more information on compute targets.
spark_context Required	<xref:SparkContext> The Spark context. Only applicable when used inside Azure Databricks/Spark environment.
X Required	DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset The training features to use when fitting pipelines during an experiment. This setting is being deprecated. Please use training_data and label_column_name instead.
y Required	DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset The training labels to use when fitting pipelines during an experiment. This is the value your model will predict. This setting is being deprecated. Please use training_data and label_column_name instead.
sample_weight Required	DataFrame or ndarray or TabularDataset The weight to give to each training sample when running fitting pipelines, each row should correspond to a row in X and y data. Specify this parameter when specifying `X`. This setting is being deprecated. Please use training_data and weight_column_name instead.
X_valid Required	DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset Validation features to use when fitting pipelines during an experiment. If specified, then `y_valid` or `sample_weight_valid` must also be specified. This setting is being deprecated. Please use validation_data and label_column_name instead.
y_valid Required	DataFrame or ndarray or Dataset or DatasetDefinition or TabularDataset Validation labels to use when fitting pipelines during an experiment. Both `X_valid` and `y_valid` must be specified together. This setting is being deprecated. Please use validation_data and label_column_name instead.
sample_weight_valid Required	DataFrame or ndarray or TabularDataset The weight to give to each validation sample when running scoring pipelines, each row should correspond to a row in X and y data. Specify this parameter when specifying `X_valid`. This setting is being deprecated. Please use validation_data and weight_column_name instead.
cv_splits_indices Required	List[List[ndarray]] Indices where to split training data for cross validation. Each row is a separate cross fold and within each crossfold, provide 2 numpy arrays, the first with the indices for samples to use for training data and the second with the indices to use for validation data. i.e., [[t1, v1], [t2, v2], ...] where t1 is the training indices for the first cross fold and v1 is the validation indices for the first cross fold. This option is supported when data is passed as separate Features dataset and Label column. To specify existing data as validation data, use `validation_data`. To let AutoML extract validation data out of training data instead, specify either `n_cross_validations` or `validation_size`. Use `cv_split_column_names` if you have cross validation column(s) in `training_data`.
validation_size Required	float What fraction of the data to hold out for validation when user validation data is not specified. This should be between 0.0 and 1.0 non-inclusive. Specify `validation_data` to provide validation data, otherwise set `n_cross_validations` or `validation_size` to extract validation data out of the specified training data. For custom cross validation fold, use `cv_split_column_names`. For more information, see Configure data splits and cross-validation in automated machine learning.
n_cross_validations Required	int or str How many cross validations to perform when user validation data is not specified. Specify `validation_data` to provide validation data, otherwise set `n_cross_validations` or `validation_size` to extract validation data out of the specified training data. For custom cross validation fold, use `cv_split_column_names`. For more information, see Configure data splits and cross-validation in automated machine learning.
y_min Required	float Minimum value of y for a regression experiment. The combination of `y_min` and `y_max` are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.
y_max Required	float Maximum value of y for a regression experiment. The combination of `y_min` and `y_max` are used to normalize test set metrics based on the input data range. This setting is being deprecated. Instead, this value will be computed from the data.
num_classes Required	int The number of classes in the label data for a classification experiment. This setting is being deprecated. Instead, this value will be computed from the data.
featurization Required	str or FeaturizationConfig 'auto' / 'off' / FeaturizationConfig Indicator for whether featurization step should be done automatically or not, or whether customized featurization should be used. Note: If the input data is sparse, featurization cannot be turned on. Column type is automatically detected. Based on the detected column type preprocessing/featurization is done as follows: Categorical: Target encoding, one hot encoding, drop high cardinality categories, impute missing values. Numeric: Impute missing values, cluster distance, weight of evidence. DateTime: Several features such as day, seconds, minutes, hours etc. Text: Bag of words, pre-trained Word embedding, text target encoding. More details can be found in the article Configure automated ML experiments in Python. To customize featurization step, provide a FeaturizationConfig object. Customized featurization currently supports blocking a set of transformers, updating column purpose, editing transformer parameters, and dropping columns. For more information, see Customize feature engineering. Note: Timeseries features are handled separately when the task type is set to forecasting independent of this parameter.
max_cores_per_iteration Required	int The maximum number of threads to use for a given training iteration. Acceptable values: Greater than 1 and less than or equal to the maximum number of cores on the compute target. Equal to -1, which means to use all the possible cores per iteration per child-run. Equal to 1, the default value.
max_concurrent_iterations Required	int Represents the maximum number of iterations that would be executed in parallel. The default value is 1. AmlCompute clusters support one interation running per node. For multiple experiments run in parallel on a single AmlCompute cluster, the sum of the `max_concurrent_iterations` values for all experiments should be less than or equal to the maximum number of nodes. DSVM supports multiple iterations per node. `max_concurrent_iterations` should be less than or equal to the number of cores on the DSVM. For multiple experiments run in parallel on a single DSVM, the sum of the `max_concurrent_iterations` values for all experiments should be less than or equal to the maximum number of nodes. Databricks - `max_concurrent_iterations` should be less than or equal to the number of worker nodes on Databricks. `max_concurrent_iterations` does not apply to local runs. Formerly, this parameter was named `concurrent_iterations`.
iteration_timeout_minutes Required	int Maximum time in minutes that each iteration can run for before it terminates. If not specified, a value of 1 month or 43200 minutes is used.
mem_in_mb Required	int Maximum memory usage that each iteration can run for before it terminates. If not specified, a value of 1 PB or 1073741824 MB is used.
enforce_time_on_windows Required	bool Whether to enforce a time limit on model training at each iteration on Windows. The default is True. If running from a Python script file (.py), see the documentation for allowing resource limits on Windows.
experiment_timeout_hours Required	float Maximum amount of time in hours that all iterations combined can take before the experiment terminates. Can be a decimal value like 0.25 representing 15 minutes. If not specified, the default experiment timeout is 6 days. To specify a timeout less than or equal to 1 hour, make sure your dataset's size is not greater than 10,000,000 (rows times column) or an error results.
experiment_exit_score Required	float Target score for experiment. The experiment terminates after this score is reached. If not specified (no criteria), the experiment runs until no further progress is made on the primary metric. For for more information on exit criteria, see this >>`article` /azure/machine-learning/how-to-configure-auto-train#exit-criteria`_<<.
enable_early_stopping Required	bool Whether to enable early termination if the score is not improving in the short term. The default is True. Early stopping logic: No early stopping for first 20 iterations (landmarks). Early stopping window starts on the 21st iteration and looks for early_stopping_n_iters iterations (currently set to 10). This means that the first iteration where stopping can occur is the 31st. AutoML still schedules 2 ensemble iterations AFTER early stopping, which might result in higher scores. Early stopping is triggered if the absolute value of best score calculated is the same for past early_stopping_n_iters iterations, that is, if there is no improvement in score for early_stopping_n_iters iterations.
blocked_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> A list of algorithms to ignore for an experiment. If `enable_tf` is False, TensorFlow models are included in `blocked_models`.
blacklist_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> Deprecated parameter, use blocked_models instead.
exclude_nan_labels Required	bool Whether to exclude rows with NaN values in the label. The default is True.
verbosity Required	int The verbosity level for writing to the log file. The default is INFO or 20. Acceptable values are defined in the Python logging library.
enable_tf Required	bool Whether to enable/disable TensorFlow algorithms. The default is False.
model_explainability Required	bool Whether to enable explaining the best AutoML model at the end of all AutoML training iterations. The default is True. For more information, see Interpretability: model explanations in automated machine learning.
allowed_models Required	list(str) or list(Classification) <xref:for classification task> or list(Regression) <xref:for regression task> or list(Forecasting) <xref:for forecasting task> A list of model names to search for an experiment. If not specified, then all models supported for the task are used minus any specified in `blocked_models` or deprecated TensorFlow models. The supported models for each task type are described in the SupportedModels class.
allowed_models Required	A list of model names to search for an experiment. If not specified, then all models supported for the task are used minus any specified in `blocked_models` or deprecated TensorFlow models. The supported models for each task type are described in the SupportedModels class.
whitelist_models Required	Deprecated parameter, use allowed_models instead.
enable_onnx_compatible_models Required	bool Whether to enable or disable enforcing the ONNX-compatible models. The default is False. For more information about Open Neural Network Exchange (ONNX) and Azure Machine Learning, see this article.
forecasting_parameters Required	ForecastingParameters An object to hold all the forecasting specific parameters.
time_column_name Required	str The name of the time column. This parameter is required when forecasting to specify the datetime column in the input data used for building the time series and inferring its frequency. This setting is being deprecated. Please use forecasting_parameters instead.
max_horizon Required	int The desired maximum forecast horizon in units of time-series frequency. The default value is 1. This setting is being deprecated. Please use forecasting_parameters instead. Units are based on the time interval of your training data, e.g., monthly, weekly that the forecaster should predict out. When task type is forecasting, this parameter is required. For more information on setting forecasting parameters, see Auto-train a time-series forecast model.
grain_column_names Required	str or list(str) The names of columns used to group a timeseries. It can be used to create multiple series. If grain is not defined, the data set is assumed to be one time-series. This parameter is used with task type forecasting. This setting is being deprecated. Please use forecasting_parameters instead.
target_lags Required	int or list(int) The number of past periods to lag from the target column. The default is 1. This setting is being deprecated. Please use forecasting_parameters instead. When forecasting, this parameter represents the number of rows to lag the target values based on the frequency of the data. This is represented as a list or single integer. Lag should be used when the relationship between the independent variables and dependant variable do not match up or correlate by default. For example, when trying to forecast demand for a product, the demand in any month may depend on the price of specific commodities 3 months prior. In this example, you may want to lag the target (demand) negatively by 3 months so that the model is training on the correct relationship. For more information, see Auto-train a time-series forecast model.
feature_lags Required	str Flag for generating lags for the numeric features. This setting is being deprecated. Please use forecasting_parameters instead.
target_rolling_window_size Required	int The number of past periods used to create a rolling window average of the target column. This setting is being deprecated. Please use forecasting_parameters instead. When forecasting, this parameter represents n historical periods to use to generate forecasted values, <= training set size. If omitted, n is the full training set size. Specify this parameter when you only want to consider a certain amount of history when training the model.
country_or_region Required	str The country/region used to generate holiday features. These should be ISO 3166 two-letter country/region codes, for example 'US' or 'GB'. This setting is being deprecated. Please use forecasting_parameters instead.
use_stl Required	str Configure STL Decomposition of the time-series target column. use_stl can take three values: None (default) - no stl decomposition, 'season' - only generate season component and season_trend - generate both season and trend components. This setting is being deprecated. Please use forecasting_parameters instead.
seasonality Required	int Set time series seasonality. If seasonality is set to -1, it will be inferred. If use_stl is not set, this parameter will not be used. This setting is being deprecated. Please use forecasting_parameters instead.
short_series_handling_configuration Required	str The parameter defining how if AutoML should handle short time series. Possible values: 'auto' (default), 'pad', 'drop' and None. auto short series will be padded if there are no long series, otherwise short series will be dropped. pad all the short series will be padded. drop all the short series will be dropped". None the short series will not be modified. If set to 'pad', the table will be padded with the zeroes and empty values for the regressors and random values for target with the mean equal to target value median for given time series id. If median is more or equal to zero, the minimal padded value will be clipped by zero: Input: Date numeric_value string target 2020-01-01 23 green 55 Output assuming minimal number of values is four: +————+—————+———-+——–+ \| Date \| numeric_value \| string \| target \| +============+===============+==========+========+ \| 2019-12-29 \| 0 \| NA \| 55.1 \| +————+—————+———-+——–+ \| 2019-12-30 \| 0 \| NA \| 55.6 \| +————+—————+———-+——–+ \| 2019-12-31 \| 0 \| NA \| 54.5 \| +————+—————+———-+——–+ \| 2020-01-01 \| 23 \| green \| 55 \| +————+—————+———-+——–+ Note: We have two parameters short_series_handling_configuration and legacy short_series_handling. When both parameters are set we are synchronize them as shown in the table below (short_series_handling_configuration and short_series_handling for brevity are marked as handling_configuration and handling respectively). handling handling_configuration resulting handling resulting handling_configuration True auto True auto True pad True auto True drop True auto True None False None False auto False None False pad False None False drop False None False None False None
freq Required	str or None Forecast frequency. When forecasting, this parameter represents the period with which the forecast is desired, for example daily, weekly, yearly, etc. The forecast frequency is dataset frequency by default. You can optionally set it to greater (but not lesser) than dataset frequency. We'll aggregate the data and generate the results at forecast frequency. For example, for daily data, you can set the frequency to be daily, weekly or monthly, but not hourly. The frequency needs to be a pandas offset alias. Please refer to pandas documentation for more information: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#dateoffset-objects
target_aggregation_function Required	str or None The function to be used to aggregate the time series target column to conform to a user specified frequency. If the target_aggregation_function is set, but the freq parameter is not set, the error is raised. The possible target aggregation functions are: "sum", "max", "min" and "mean". freq target_aggregation_function Data regularity fixing mechanism None (Default) None (Default) The aggregation is not applied.If the valid frequency can not bedetermined the error will be raised. Some Value None (Default) The aggregation is not applied.If the number of data points compliantto given frequency grid is less then 90%these points will be removed, otherwisethe error will be raised. None (Default) Aggregation function The error about missing frequency parameteris raised. Some Value Aggregation function Aggregate to frequency using providedaggregation function.
enable_voting_ensemble Required	bool Whether to enable/disable VotingEnsemble iteration. The default is True. For more information about ensembles, see Ensemble configuration.
enable_stack_ensemble Required	bool Whether to enable/disable StackEnsemble iteration. The default is None. If enable_onnx_compatible_models flag is being set, then StackEnsemble iteration will be disabled. Similarly, for Timeseries tasks, StackEnsemble iteration will be disabled by default, to avoid risks of overfitting due to small training set used in fitting the meta learner. For more information about ensembles, see Ensemble configuration.
debug_log Required	str The log file to write debug information to. If not specified, 'automl.log' is used.
training_data Required	DataFrame or Dataset or DatasetDefinition or TabularDataset The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column). If `training_data` is specified, then the `label_column_name` parameter must also be specified. `training_data` was introduced in version 1.0.81.
validation_data Required	DataFrame or Dataset or DatasetDefinition or TabularDataset The validation data to be used within the experiment. It should contain both training features and label column (optionally a sample weights column). If `validation_data` is specified, then `training_data` and `label_column_name` parameters must be specified. `validation_data` was introduced in version 1.0.81. For more information, see Configure data splits and cross-validation in automated machine learning.
test_data Required	Dataset or TabularDataset The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. The test data to be used for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions. If this parameter or the `test_size` parameter are not specified then no test run will be executed automatically after model training is completed. Test data should contain both features and label column. If `test_data` is specified then the `label_column_name` parameter must be specified.
test_size Required	float The Model Test feature using test datasets or test data splits is a feature in Preview state and might change at any time. What fraction of the training data to hold out for test data for a test run that will automatically be started after model training is complete. The test run will get predictions using the best model and will compute metrics given these predictions. This should be between 0.0 and 1.0 non-inclusive. If `test_size` is specified at the same time as `validation_size`, then the test data is split from `training_data` before the validation data is split. For example, if `validation_size=0.1`, `test_size=0.1` and the original training data has 1000 rows, then the test data will have 100 rows, the validation data will contain 90 rows and the training data will have 810 rows. For regression based tasks, random sampling is used. For classification tasks, stratified sampling is used. Forecasting does not currently support specifying a test dataset using a train/test split. If this parameter or the `test_data` parameter are not specified then no test run will be executed automatically after model training is completed.
label_column_name Required	Union[str, int] The name of the label column. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. This parameter is applicable to `training_data`, `validation_data` and `test_data` parameters. `label_column_name` was introduced in version 1.0.81.
weight_column_name Required	Union[str, int] The name of the sample weight column. Automated ML supports a weighted column as an input, causing rows in the data to be weighted up or down. If the input data is from a pandas.DataFrame which doesn't have column names, column indices can be used instead, expressed as integers. This parameter is applicable to `training_data` and `validation_data` parameters. `weight_column_names` was introduced in version 1.0.81.
cv_split_column_names Required	list(str) List of names of the columns that contain custom cross validation split. Each of the CV split columns represents one CV split where each row are either marked 1 for training or 0 for validation. This parameter is applicable to `training_data` parameter for custom cross validation purposes. `cv_split_column_names` was introduced in version 1.6.0 Use either `cv_split_column_names` or `cv_splits_indices`. For more information, see Configure data splits and cross-validation in automated machine learning.
enable_local_managed Required	bool Disabled parameter. Local managed runs can not be enabled at this time.
enable_dnn Required	bool Whether to include DNN based models during model selection. The default in the init is None. However, the default is True for DNN NLP tasks, and it's False for all other AutoML tasks.

Remarks

The following code shows a basic example of creating an AutoMLConfig object and submitting an experiment for regression:


   automl_settings = {
       "n_cross_validations": 3,
       "primary_metric": 'r2_score',
       "enable_early_stopping": True,
       "experiment_timeout_hours": 1.0,
       "max_concurrent_iterations": 4,
       "max_cores_per_iteration": -1,
       "verbosity": logging.INFO,
   }

   automl_config = AutoMLConfig(task = 'regression',
                               compute_target = compute_target,
                               training_data = train_data,
                               label_column_name = label,
                               **automl_settings
                               )

   ws = Workspace.from_config()
   experiment = Experiment(ws, "your-experiment-name")
   run = experiment.submit(automl_config, show_output=True)

A full sample is available at Regression

Examples of using AutoMLConfig for forecasting are in these notebooks:

Examples of using AutoMLConfig for all task types can be found in these automated ML notebooks.

For background on automated ML, see the articles:

How to define a machine learning task
Configure automated ML experiments in Python. In this article, there is information about the different algorithms and primary metrics used for each task type.
Auto-train a time-series forecast model. In this article, there is information about which constructor parameters and **kwargs are used in forecasting.

For more information about different options for configuring training/validation data splits and cross-validation for your automated machine learning, AutoML, experiments, see Configure data splits and cross-validation in automated machine learning.

Methods

as_serializable_dict	Convert the object into dictionary.
get_supported_dataset_languages	Get supported languages and their corresponding language codes in ISO 639-3.

as_serializable_dict

Convert the object into dictionary.

as_serializable_dict() -> Dict[str, Any]

get_supported_dataset_languages

Get supported languages and their corresponding language codes in ISO 639-3.

get_supported_dataset_languages(use_gpu: bool) -> Dict[Any, Any]

Parameters

Name	Description
cls Required	Class object of AutoMLConfig.
use_gpu Required	boolean indicating whether gpu compute is being used or not.

Returns

Type	Description
	dictionary of format {: }. Language code adheres to ISO 639-3 standard, please refer to https://en.wikipedia.org/wiki/List_of_ISO_639-3_codes

Share via

AutoMLConfig Class

Constructor

Parameters

Remarks

Methods

as_serializable_dict

get_supported_dataset_languages

Parameters

Returns

Feedback

Feedback

Additional resources