Del via


Mosaic AutoML Python API reference

This article describes the Mosaic AutoML Python API, which provides methods to start classification, regression, and forecasting AutoML runs. Each method call trains a set of models and generates a trial notebook for each model.

For more information on Mosaic AutoML, including a low-code UI option, see What is Mosaic AutoML?.

Classify

The databricks.automl.classify method configures an Mosaic AutoML run to train a classification model.

Note

The max_trials parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.classify(
  dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
  *,
  target_col: str,
  primary_metric: str = "f1",
  data_dir: Optional[str] = None,
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  experiment_name: Optional[str] = None,                            # <DBR> 12.1 ML and above
  exclude_cols: Optional[List[str]] = None,                         # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  feature_store_lookups: Optional[List[Dict]] = None,               # <DBR> 11.3 LTS ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  pos_label: Optional[Union[int, bool, str]] = None,                 # <DBR> 11.1 ML and above
  time_col: Optional[str] = None,
  split_col: Optional[str] = None,                                  # <DBR> 15.3 ML and above
  sample_weight_col: Optional[str] = None                           # <DBR> 15.4 ML and above
  max_trials: Optional[int] = None,                                 # <DBR> 10.5 ML and below
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Classify parameters

Parameter name Type Description
dataset str, pandas.DataFrame, pyspark.DataFrame, pyspark.sql.DataFrame Input table name or DataFrame that contains training features and target. Table name can be in format “<database_name>.<table_name>” or “<schema_name>.<table_name>” for non Unity Catalog tables.
target_col str Column name for the target label.
primary_metric str Metric used to evaluate and rank model performance.

Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse”

Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”
data_dir str of format dbfs:/<folder-name> Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes.

Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact.

If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions.
experiment_dir str Optional. Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/
experiment_name str Optional. Name for the MLflow experiment that AutoML creates.

Default: Name is automatically generated.
exclude_cols List[str] Optional. List of columns to ignore during AutoML calculations.

Default: []
exclude_frameworks List[str] Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”.

Default: [] (all frameworks are considered)
feature_store_lookups List[Dict] Optional. List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:

- table_name (str): Required. Name of the feature table.
- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.
- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.

Default: []
imputers Dict[str, Union[str, Dict[str, Any]]] Optional. Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {"strategy": "constant", "fill_value": <desired value>}. You can also specify string options as dictionaries, for example {"strategy": "mean"}.

If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection.

Default: {}
pos_label Union[int, bool, str] (Classification only) The positive class. This is useful for calculating metrics such as precision and recall. Should only be specified for binary classification problems.
time_col str Available in Databricks Runtime 10.1 ML and above.

Optional. Column name for a time column.

If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set.

Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported.

If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails.
split_col str Optional. Column name for a split column. Only available in Databricks Runtime 15.3 ML and above for API workflows. If provided, AutoML tries to split train/validate/test sets by user-specified values, and this column is automatically excluded from training features.

Accepted column type is string. The value of each entry in this column must be one of the following: “train”, “validate”, or “test”.
sample_weight_col str Available in Databricks Runtime 15.4 ML and above for classification API workflows.

Optional. Column name in the dataset that contains the sample weights for each row. Classification supports per-class sample weights. These weights adjust the importance of each class during model training. Each sample within a class must have the same sample weight and weights must be non-negative decimal or integer values, ranging from 0 to 10,000. Classes with higher sample weights are considered more important, and have a greater influence on the learning algorithm. If this column is not specified, all classes are assumed to have equal weight.
max_trials int Optional. Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.
timeout_minutes int Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: 120 minutes

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Regress

The databricks.automl.regress method configures an AutoML run to train a regression model. This method returns an AutoMLSummary.

Note

The max_trials parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes to control the duration of an AutoML run.

databricks.automl.regress(
  dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
  *,
  target_col: str,
  primary_metric: str = "r2",
  data_dir: Optional[str] = None,
  experiment_dir: Optional[str] = None,                             # <DBR> 10.4 LTS ML and above
  experiment_name: Optional[str] = None,                            # <DBR> 12.1 ML and above
  exclude_cols: Optional[List[str]] = None,                         # <DBR> 10.3 ML and above
  exclude_frameworks: Optional[List[str]] = None,                   # <DBR> 10.3 ML and above
  feature_store_lookups: Optional[List[Dict]] = None,               # <DBR> 11.3 LTS ML and above
  imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
  time_col: Optional[str] = None,
  split_col: Optional[str] = None,                                  # <DBR> 15.3 ML and above
  sample_weight_col: Optional[str] = None,                          # <DBR> 15.3 ML and above
  max_trials: Optional[int] = None,                                 # <DBR> 10.5 ML and below
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Regress parameters

Parameter name Type Description
dataset str, pandas.DataFrame, pyspark.DataFrame, pyspark.sql.DataFrame Input table name or DataFrame that contains training features and target. Table name can be in format “<database_name>.<table_name>” or “<schema_name>.<table_name>” for non Unity Catalog tables.
target_col str Column name for the target label.
primary_metric str Metric used to evaluate and rank model performance.

Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse”

Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc”
data_dir str of format dbfs:/<folder-name> Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes.

Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact.

If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions.
experiment_dir str Optional. Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/
experiment_name str Optional. Name for the MLflow experiment that AutoML creates.

Default: Name is automatically generated.
exclude_cols List[str] Optional. List of columns to ignore during AutoML calculations.

Default: []
exclude_frameworks List[str] Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”.

Default: [] (all frameworks are considered)
feature_store_lookups List[Dict] Optional. List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are:

- table_name (str): Required. Name of the feature table.
- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.
- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.

Default: []
imputers Dict[str, Union[str, Dict[str, Any]]] Optional. Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {"strategy": "constant", "fill_value": <desired value>}. You can also specify string options as dictionaries, for example {"strategy": "mean"}.

If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection.

Default: {}
time_col str Available in Databricks Runtime 10.1 ML and above.

Optional. Column name for a time column.

If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set.

Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported.

If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails.
split_col str Optional. Column name for a split column. Only available in Databricks Runtime 15.3 ML and above for API workflows. If provided, AutoML tries to split train/validate/test sets by user-specified values, and this column is automatically excluded from training features.

Accepted column type is string. The value of each entry in this column must be one of the following: “train”, “validate”, or “test”.
sample_weight_col str Available in Databricks Runtime 15.3 ML and above for regression API workflows.

Optional. Column name in the dataset that contains the sample weights for each row. These weights adjust the importance of each row during model training. Weights must be non-negative decimal or integer values, ranging from 0 to 10,000. Rows with higher sample weights are considered more important, and have a greater influence on the learning algorithm. If this column is not specified, all rows are assumed to have equal weight.
max_trials int Optional. Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported.

Default: 20

If timeout_minutes=None, AutoML runs the maximum number of trials.
timeout_minutes int Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: 120 minutes

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Forecast

The databricks.automl.forecast method configures an AutoML run for training a forecasting model. This method returns an AutoMLSummary. To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call. AutoML handles missing time steps by filling in those values with the previous value.

databricks.automl.forecast(
  dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
  *,
  target_col: str,
  time_col: str,
  primary_metric: str = "smape",
  country_code: str = "US",                                         # <DBR> 12.0 ML and above
  frequency: str = "D",
  horizon: int = 1,
  data_dir: Optional[str] = None,
  experiment_dir: Optional[str] = None,
  experiment_name: Optional[str] = None,                            # <DBR> 12.1 ML and above
  exclude_frameworks: Optional[List[str]] = None,
  feature_store_lookups: Optional[List[Dict]] = None,               # <DBR> 12.2 LTS ML and above
  identity_col: Optional[Union[str, List[str]]] = None,
  sample_weight_col: Optional[str] = None,                          # <DBR> 16.0 ML and above
  output_database: Optional[str] = None,                            # <DBR> 10.5 ML and above
  timeout_minutes: Optional[int] = None,
) -> AutoMLSummary

Forecasting parameters

Parameter name Type Description
dataset str, pandas.DataFrame, pyspark.DataFrame, pyspark.sql.DataFrame Input table name or DataFrame that contains training features and target.

Table name can be in format “..” or “.” for non Unity Catalog tables
target_col str Column name for the target label.
time_col str Name of the time column for forecasting.
primary_metric str Metric used to evaluate and rank model performance.

Supported metrics: “smape” (default), “mse”, “rmse”, “mae”, or “mdape”.
country_code str Available in Databricks Runtime 12.0 ML and above. Only supported by the Prophet forecasting model.

Optional. Two-letter country code that indicates which country’s holidays the forecasting model should use. To ignore holidays, set this parameter to an empty string (“”).

Supported countries.

Default: US (United States holidays).
frequency str Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is “D” or daily data. Be sure to change the setting if your data has a different frequency.

Possible values:

“W” (weeks)

“D” / “days” / “day”

“hours” / “hour” / “hr” / “h”

“m” / “minute” / “min” / “minutes” / “T”

“S” / “seconds” / “sec” / “second”

The following are only available with Databricks Runtime 12.0 ML and above:

“M” / “month” / “months”

“Q” / “quarter” / “quarters”

“Y” / “year” / “years”

Default: “D”
horizon int Number of periods into the future for which forecasts should be returned.

The units are the time series frequency.

Default: 1
data_dir str of format dbfs:/<folder-name> Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes.

Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact.

If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions.
experiment_dir str Optional. Path to the directory in the workspace to save the generated notebooks and experiments.

Default: /Users/<username>/databricks_automl/
experiment_name str Optional. Name for the MLflow experiment that AutoML creates.

Default: Name is automatically generated.
exclude_frameworks List[str] Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “prophet”, “arima”.

Default: [] (all frameworks are considered)
feature_store_lookups List[Dict] Optional. List of dictionaries that represent features from Feature Store for covariate data augmentation. Valid keys in each dictionary are:

- table_name (str): Required. Name of the feature table.
- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.
- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.

Default: []
identity_col Union[str, list] Optional. Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting.
sample_weight_col str Available in Databricks Runtime 16.0 ML and above. Only for multi-time-series workflows.

Optional. Specifies the column in the dataset that contains sample weights. These weights indicate the relative importance of each time series during model training and evaluation.

Time series with higher weights have a greater influence on the model. If not provided, all time series are treated with equal weight.

All rows belonging to the same time series must have the same weight.

Weights must be non-negative values, either decimals or integers, and be between 0 and 10,000.
output_database str Optional. If provided, AutoML saves predictions of the best model to a new table in the specified database.

Default: Predictions are not saved.
timeout_minutes int Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy.

Default: 120 minutes

Minimum value: 5 minutes

An error is reported if the timeout is too short to allow at least one trial to complete.

Import notebook

The databricks.automl.import_notebook method imports a notebook that has been saved as an MLflow artifact. This method returns an ImportNotebookResult.

databricks.automl.import_notebook(
  artifact_uri: str,
  path: str,
  overwrite: bool = False
) -> ImportNotebookResult:
Parameters Type Description
artifact_uri str The URI of the MLflow artifact that contains the trial notebook.
path str The path in the Databricks workspace where the notebook should be imported. This must be an absolute path. The directory will be created if it does not exist.
overwrite bool Whether to overwrite the notebook if it already exists. It is False by default.

Import notebook example

summary = databricks.automl.classify(...)
result = databricks.automl.import_notebook(summary.trials[5].artifact_uri, "/Users/you@yourcompany.com/path/to/directory")
print(result.path)
print(result.url)

AutoMLSummary

Summary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.

Property Type Description
experiment mlflow.entities.Experiment The MLflow experiment used to log the trials.
trials List[TrialInfo] A list of TrialInfo objects containing information about all the trials that were run.
best_trial TrialInfo A TrialInfo object containing information about the trial that resulted in the best weighted score for the primary metric.
metric_distribution str The distribution of weighted scores for the primary metric across all trials.
output_table_name str Used with forecasting only and only if output_database is provided.

Name of the table in output_database containing the model’s predictions.

TrialInfo

Summary object for each individual trial.

Property Type Description
notebook_path Optional[str] The path to the generated notebook for this trial in the workspace.

For classification and regression, this value is set only for the best trial, while all other trials have the value set to None.

For forecasting, this value is present for all trials.
notebook_url Optional[str] The URL of the generated notebook for this trial.

For classification and regression, this value is set only for the best trial, while all other trials have the value set to None.

For forecasting, this value is present for all trials.
artifact_uri Optional[str] The MLflow artifact URI for the generated notebook.
mlflow_run_id str The MLflow run ID associated with this trial run.
metrics Dict[str, float] The metrics logged in MLflow for this trial.
params Dict[str, str] The parameters logged in MLflow that were used for this trial.
model_path str The MLflow artifact URL of the model trained in this trial.
model_description str Short description of the model and the hyperparameters used for training this model.
duration str Training duration in minutes.
preprocessors str Description of the preprocessors run before training the model.
evaluation_metric_score float Score of primary metric, evaluated for the validation dataset.

TrialInfo has a method to load the model generated for the trial.

Method Description
load_model() Load the model generated in this trial, logged as an MLflow artifact.

ImportNotebookResult

Property Type Description
path str The path in the Databricks workspace where the notebook should be imported. This must be an absolute path. The directory will be created if it does not exist.
url str The URI of the MLflow artifact that contains the trial notebook.