Mosaic AutoML Python API reference
This article describes the Mosaic AutoML Python API, which provides methods to start classification, regression, and forecasting AutoML runs. Each method call trains a set of models and generates a trial notebook for each model.
For more information on Mosaic AutoML, including a low-code UI option, see What is Mosaic AutoML?.
Classify
The databricks.automl.classify
method configures an Mosaic AutoML run to train a classification model.
Note
The max_trials
parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes
to control the duration of an AutoML run.
databricks.automl.classify(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
primary_metric: str = "f1",
data_dir: Optional[str] = None,
experiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
exclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above
exclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above
imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
pos_label: Optional[Union[int, bool, str]] = None, # <DBR> 11.1 ML and above
time_col: Optional[str] = None,
split_col: Optional[str] = None, # <DBR> 15.3 ML and above
sample_weight_col: Optional[str] = None # <DBR> 15.4 ML and above
max_trials: Optional[int] = None, # <DBR> 10.5 ML and below
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Classify parameters
Parameter name | Type | Description |
---|---|---|
dataset |
str , pandas.DataFrame , pyspark.DataFrame , pyspark.sql.DataFrame |
Input table name or DataFrame that contains training features and target. Table name can be in format “<database_name>.<table_name>” or “<schema_name>.<table_name>” for non Unity Catalog tables. |
target_col |
str |
Column name for the target label. |
primary_metric |
str |
Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc” |
data_dir |
str of format dbfs:/<folder-name> |
Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact. If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions. |
experiment_dir |
str |
Optional. Path to the directory in the workspace to save the generated notebooks and experiments. Default: /Users/<username>/databricks_automl/ |
experiment_name |
str |
Optional. Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |
exclude_cols |
List[str] |
Optional. List of columns to ignore during AutoML calculations. Default: [] |
exclude_frameworks |
List[str] |
Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”. Default: [] (all frameworks are considered) |
feature_store_lookups |
List[Dict] |
Optional. List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are: - table_name (str): Required. Name of the feature table.- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.Default: [] |
imputers |
Dict[str, Union[str, Dict[str, Any]]] |
Optional. Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {"strategy": "constant", "fill_value": <desired value>} . You can also specify string options as dictionaries, for example {"strategy": "mean"} .If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection. Default: {} |
pos_label |
Union[int, bool, str] |
(Classification only) The positive class. This is useful for calculating metrics such as precision and recall. Should only be specified for binary classification problems. |
time_col |
str |
Available in Databricks Runtime 10.1 ML and above. Optional. Column name for a time column. If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set. Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails. |
split_col |
str |
Optional. Column name for a split column. Only available in Databricks Runtime 15.3 ML and above for API workflows. If provided, AutoML tries to split train/validate/test sets by user-specified values, and this column is automatically excluded from training features. Accepted column type is string. The value of each entry in this column must be one of the following: “train”, “validate”, or “test”. |
sample_weight_col |
str |
Available in Databricks Runtime 15.4 ML and above for classification API workflows. Optional. Column name in the dataset that contains the sample weights for each row. Classification supports per-class sample weights. These weights adjust the importance of each class during model training. Each sample within a class must have the same sample weight and weights must be non-negative decimal or integer values, ranging from 0 to 10,000. Classes with higher sample weights are considered more important, and have a greater influence on the learning algorithm. If this column is not specified, all classes are assumed to have equal weight. |
max_trials |
int |
Optional. Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported. Default: 20 If timeout_minutes=None, AutoML runs the maximum number of trials. |
timeout_minutes |
int |
Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: 120 minutes Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |
Regress
The databricks.automl.regress
method configures an AutoML run to train a regression model. This method returns an AutoMLSummary.
Note
The max_trials
parameter is deprecated in Databricks Runtime 10.4 ML and is not supported in Databricks Runtime 11.0 ML and above. Use timeout_minutes
to control the duration of an AutoML run.
databricks.automl.regress(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
primary_metric: str = "r2",
data_dir: Optional[str] = None,
experiment_dir: Optional[str] = None, # <DBR> 10.4 LTS ML and above
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
exclude_cols: Optional[List[str]] = None, # <DBR> 10.3 ML and above
exclude_frameworks: Optional[List[str]] = None, # <DBR> 10.3 ML and above
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 11.3 LTS ML and above
imputers: Optional[Dict[str, Union[str, Dict[str, Any]]]] = None, # <DBR> 10.4 LTS ML and above
time_col: Optional[str] = None,
split_col: Optional[str] = None, # <DBR> 15.3 ML and above
sample_weight_col: Optional[str] = None, # <DBR> 15.3 ML and above
max_trials: Optional[int] = None, # <DBR> 10.5 ML and below
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Regress parameters
Parameter name | Type | Description |
---|---|---|
dataset |
str , pandas.DataFrame , pyspark.DataFrame , pyspark.sql.DataFrame |
Input table name or DataFrame that contains training features and target. Table name can be in format “<database_name>.<table_name>” or “<schema_name>.<table_name>” for non Unity Catalog tables. |
target_col |
str |
Column name for the target label. |
primary_metric |
str |
Metric used to evaluate and rank model performance. Supported metrics for regression: “r2” (default), “mae”, “rmse”, “mse” Supported metrics for classification: “f1” (default), “log_loss”, “precision”, “accuracy”, “roc_auc” |
data_dir |
str of format dbfs:/<folder-name> |
Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact. If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions. |
experiment_dir |
str |
Optional. Path to the directory in the workspace to save the generated notebooks and experiments. Default: /Users/<username>/databricks_automl/ |
experiment_name |
str |
Optional. Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |
exclude_cols |
List[str] |
Optional. List of columns to ignore during AutoML calculations. Default: [] |
exclude_frameworks |
List[str] |
Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “sklearn”, “lightgbm”, “xgboost”. Default: [] (all frameworks are considered) |
feature_store_lookups |
List[Dict] |
Optional. List of dictionaries that represent features from Feature Store for data augmentation. Valid keys in each dictionary are: - table_name (str): Required. Name of the feature table.- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.Default: [] |
imputers |
Dict[str, Union[str, Dict[str, Any]]] |
Optional. Dictionary where each key is a column name, and each value is a string or dictionary describing the imputation strategy. If specified as a string, the value must be one of “mean”, “median”, or “most_frequent”. To impute with a known value, specify the value as a dictionary {"strategy": "constant", "fill_value": <desired value>} . You can also specify string options as dictionaries, for example {"strategy": "mean"} .If no imputation strategy is provided for a column, AutoML selects a default strategy based on column type and content. If you specify a non-default imputation method, AutoML does not perform semantic type detection. Default: {} |
time_col |
str |
Available in Databricks Runtime 10.1 ML and above. Optional. Column name for a time column. If provided, AutoML tries to split the dataset into training, validation, and test sets chronologically, using the earliest points as training data and the latest points as a test set. Accepted column types are timestamp and integer. With Databricks Runtime 10.2 ML and above, string columns are also supported. If column type is string, AutoML tries to convert it to timestamp using semantic detection. If the conversion fails, the AutoML run fails. |
split_col |
str |
Optional. Column name for a split column. Only available in Databricks Runtime 15.3 ML and above for API workflows. If provided, AutoML tries to split train/validate/test sets by user-specified values, and this column is automatically excluded from training features. Accepted column type is string. The value of each entry in this column must be one of the following: “train”, “validate”, or “test”. |
sample_weight_col |
str |
Available in Databricks Runtime 15.3 ML and above for regression API workflows. Optional. Column name in the dataset that contains the sample weights for each row. These weights adjust the importance of each row during model training. Weights must be non-negative decimal or integer values, ranging from 0 to 10,000. Rows with higher sample weights are considered more important, and have a greater influence on the learning algorithm. If this column is not specified, all rows are assumed to have equal weight. |
max_trials |
int |
Optional. Maximum number of trials to run. This parameter is available in Databricks Runtime 10.5 ML and below, but is deprecated starting in Databricks Runtime 10.3 ML. In Databricks Runtime 11.0 ML and above, this parameter is not supported. Default: 20 If timeout_minutes=None, AutoML runs the maximum number of trials. |
timeout_minutes |
int |
Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: 120 minutes Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |
Forecast
The databricks.automl.forecast
method configures an AutoML run for training a forecasting model. This method returns an AutoMLSummary.
To use Auto-ARIMA, the time series must have a regular frequency (that is, the interval between any two points must be the same throughout the time series). The frequency must match the frequency unit specified in the API call. AutoML handles missing time steps by filling in those values with the previous value.
databricks.automl.forecast(
dataset: Union[pyspark.sql.DataFrame, pandas.DataFrame, pyspark.pandas.DataFrame, str],
*,
target_col: str,
time_col: str,
primary_metric: str = "smape",
country_code: str = "US", # <DBR> 12.0 ML and above
frequency: str = "D",
horizon: int = 1,
data_dir: Optional[str] = None,
experiment_dir: Optional[str] = None,
experiment_name: Optional[str] = None, # <DBR> 12.1 ML and above
exclude_frameworks: Optional[List[str]] = None,
feature_store_lookups: Optional[List[Dict]] = None, # <DBR> 12.2 LTS ML and above
identity_col: Optional[Union[str, List[str]]] = None,
sample_weight_col: Optional[str] = None, # <DBR> 16.0 ML and above
output_database: Optional[str] = None, # <DBR> 10.5 ML and above
timeout_minutes: Optional[int] = None,
) -> AutoMLSummary
Forecasting parameters
Parameter name | Type | Description |
---|---|---|
dataset |
str , pandas.DataFrame , pyspark.DataFrame , pyspark.sql.DataFrame |
Input table name or DataFrame that contains training features and target. Table name can be in format “..” or “.” for non Unity Catalog tables |
target_col |
str |
Column name for the target label. |
time_col |
str |
Name of the time column for forecasting. |
primary_metric |
str |
Metric used to evaluate and rank model performance. Supported metrics: “smape” (default), “mse”, “rmse”, “mae”, or “mdape”. |
country_code |
str |
Available in Databricks Runtime 12.0 ML and above. Only supported by the Prophet forecasting model. Optional. Two-letter country code that indicates which country’s holidays the forecasting model should use. To ignore holidays, set this parameter to an empty string (“”). Supported countries. Default: US (United States holidays). |
frequency |
str |
Frequency of the time series for forecasting. This is the period with which events are expected to occur. The default setting is “D” or daily data. Be sure to change the setting if your data has a different frequency. Possible values: “W” (weeks) “D” / “days” / “day” “hours” / “hour” / “hr” / “h” “m” / “minute” / “min” / “minutes” / “T” “S” / “seconds” / “sec” / “second” The following are only available with Databricks Runtime 12.0 ML and above: “M” / “month” / “months” “Q” / “quarter” / “quarters” “Y” / “year” / “years” Default: “D” |
horizon |
int |
Number of periods into the future for which forecasts should be returned. The units are the time series frequency. Default: 1 |
data_dir |
str of format dbfs:/<folder-name> |
Optional. DBFS path used to store the training dataset. This path is visible to both driver and worker nodes. Databricks recommends leaving this field empty, so AutoML can save the training dataset as an MLflow artifact. If a custom path is specified, the dataset does not inherit the AutoML experiment’s access permissions. |
experiment_dir |
str |
Optional. Path to the directory in the workspace to save the generated notebooks and experiments. Default: /Users/<username>/databricks_automl/ |
experiment_name |
str |
Optional. Name for the MLflow experiment that AutoML creates. Default: Name is automatically generated. |
exclude_frameworks |
List[str] |
Optional. List of algorithm frameworks that AutoML should not consider as it develops models. Possible values: empty list, or one or more of “prophet”, “arima”. Default: [] (all frameworks are considered) |
feature_store_lookups |
List[Dict] |
Optional. List of dictionaries that represent features from Feature Store for covariate data augmentation. Valid keys in each dictionary are: - table_name (str): Required. Name of the feature table.- lookup_key (list or str): Required. Column name(s) to use as key when joining the feature table with the data passed in the dataset param. The order of the column names must match the order of the primary keys of the feature table.- timestamp_lookup_key (str): Required if the specified table is a time series feature table. The column name to use when performing point-in-time lookup on the feature table with the data passed in the dataset param.Default: [] |
identity_col |
Union[str, list] |
Optional. Column(s) that identify the time series for multi-series forecasting. AutoML groups by these column(s) and the time column for forecasting. |
sample_weight_col |
str |
Available in Databricks Runtime 16.0 ML and above. Only for multi-time-series workflows. Optional. Specifies the column in the dataset that contains sample weights. These weights indicate the relative importance of each time series during model training and evaluation. Time series with higher weights have a greater influence on the model. If not provided, all time series are treated with equal weight. All rows belonging to the same time series must have the same weight. Weights must be non-negative values, either decimals or integers, and be between 0 and 10,000. |
output_database |
str |
Optional. If provided, AutoML saves predictions of the best model to a new table in the specified database. Default: Predictions are not saved. |
timeout_minutes |
int |
Optional. Maximum time to wait for AutoML trials to complete. Longer timeouts allow AutoML to run more trials and identify a model with better accuracy. Default: 120 minutes Minimum value: 5 minutes An error is reported if the timeout is too short to allow at least one trial to complete. |
Import notebook
The databricks.automl.import_notebook
method imports a notebook that has been saved as an MLflow artifact. This method returns an ImportNotebookResult.
databricks.automl.import_notebook(
artifact_uri: str,
path: str,
overwrite: bool = False
) -> ImportNotebookResult:
Parameters | Type | Description |
---|---|---|
artifact_uri |
str |
The URI of the MLflow artifact that contains the trial notebook. |
path |
str |
The path in the Databricks workspace where the notebook should be imported. This must be an absolute path. The directory will be created if it does not exist. |
overwrite |
bool |
Whether to overwrite the notebook if it already exists. It is False by default. |
Import notebook example
summary = databricks.automl.classify(...)
result = databricks.automl.import_notebook(summary.trials[5].artifact_uri, "/Users/you@yourcompany.com/path/to/directory")
print(result.path)
print(result.url)
AutoMLSummary
Summary object for an AutoML run that describes the metrics, parameters, and other details for each of the trials. You also use this object to load the model trained by a specific trial.
Property | Type | Description |
---|---|---|
experiment |
mlflow.entities.Experiment |
The MLflow experiment used to log the trials. |
trials |
List[TrialInfo] |
A list of TrialInfo objects containing information about all the trials that were run. |
best_trial |
TrialInfo |
A TrialInfo object containing information about the trial that resulted in the best weighted score for the primary metric. |
metric_distribution |
str |
The distribution of weighted scores for the primary metric across all trials. |
output_table_name |
str |
Used with forecasting only and only if output_database is provided. Name of the table in output_database containing the model’s predictions. |
TrialInfo
Summary object for each individual trial.
Property | Type | Description |
---|---|---|
notebook_path |
Optional[str] |
The path to the generated notebook for this trial in the workspace. For classification and regression, this value is set only for the best trial, while all other trials have the value set to None .For forecasting, this value is present for all trials. |
notebook_url |
Optional[str] |
The URL of the generated notebook for this trial. For classification and regression, this value is set only for the best trial, while all other trials have the value set to None .For forecasting, this value is present for all trials. |
artifact_uri |
Optional[str] |
The MLflow artifact URI for the generated notebook. |
mlflow_run_id |
str |
The MLflow run ID associated with this trial run. |
metrics |
Dict[str, float] |
The metrics logged in MLflow for this trial. |
params |
Dict[str, str] |
The parameters logged in MLflow that were used for this trial. |
model_path |
str |
The MLflow artifact URL of the model trained in this trial. |
model_description |
str |
Short description of the model and the hyperparameters used for training this model. |
duration |
str |
Training duration in minutes. |
preprocessors |
str |
Description of the preprocessors run before training the model. |
evaluation_metric_score |
float |
Score of primary metric, evaluated for the validation dataset. |
TrialInfo
has a method to load the model generated for the trial.
Method | Description |
---|---|
load_model() |
Load the model generated in this trial, logged as an MLflow artifact. |
ImportNotebookResult
Property | Type | Description |
---|---|---|
path |
str |
The path in the Databricks workspace where the notebook should be imported. This must be an absolute path. The directory will be created if it does not exist. |
url |
str |
The URI of the MLflow artifact that contains the trial notebook. |