使用 AutoML Python API 训练预测模型

打开此页面的笔记本版本

此示例笔记本演示如何使用 AutoML Python API 在 Databricks 上训练时序预测模型。使用 COVID-19 案例计数数据集，调用 automl.forecast() 结合30天的每日时间范围，以预测未来的案例计数，然后用 MLflow 加载最佳模型以生成并绘制预测结果。

要求

Databricks Runtime for 机器学习版本 10.0 或更高。
若要保存模型预测，Databricks Runtime 用于机器学习 10.5 或更高版本。

COVID-19 数据集

数据集包含按日期记录在美国的新冠病毒病例数的记录，以及附加的地理信息。目标是预测未来30天美国将发生多少例病毒。

import pyspark.pandas as ps
df = ps.read_csv("/databricks-datasets/COVID/covid-19-data")
df["date"] = ps.to_datetime(df['date'], errors='coerce')
df["cases"] = df["cases"].astype(int)
display(df)

AutoML 训练

以下命令启动 AutoML 运行。您必须在 target_col 参数中提供模型应预测的列，以及时间列。运行完成后，可以按照指向最佳试用笔记本的链接来检查训练代码。

此示例还指定：

horizon=30 指定 AutoML 进行预测，以预测未来 30 天。
frequency="d" 指定每天应提供预测。
primary_metric="mdape" 指定在训练期间要优化的指标。

注意

automl.forecast() 仅适用于经典计算。

import databricks.automl
import logging

# Disable informational messages from fbprophet
logging.getLogger("py4j").setLevel(logging.WARNING)

# Note: If you are running Databricks Runtime for Machine Learning 10.4 or below, use this line instead:
# summary = databricks.automl.forecast(df, target_col="cases", time_col="date", horizon=30, frequency="d",  primary_metric="mdape")

summary = databricks.automl.forecast(df, target_col="cases", time_col="date", horizon=30, frequency="d",  primary_metric="mdape", output_database="default")

对模型进行迭代

浏览上面链接的笔记本和试验。
如果最佳试用笔记本的指标看起来不错，可以继续下一个单元格。
如果要改进由最佳试用版生成的模型：
- 转到包含最佳试验结果的笔记本并克隆它。
- 根据需要编辑笔记本以改进模型。
- 如果对模型感到满意，请记下记录已训练模型的工件的 URI。将此 URI 分配给 model_uri 下一个单元格中的变量。

显示最佳模型的预测结果

注意：本部分需要 Databricks Runtime for 机器学习 10.5 或更高版本。

从最佳模型加载预测

在 databricks Runtime for 机器学习 10.5 或更高版本中，如果提供了 output_database，AutoML 将从最佳模型保存预测。

# Load the saved predictions.
forecast_pd = spark.table(summary.output_table_name)
display(forecast_pd)

使用模型进行预测

可以将本部分中的命令用于 databricks Runtime for 机器学习 10.0 或更高版本。

使用 MLflow 加载模型

使用 MLflow，可以使用 AutoML trial_id 轻松地将模型导入回Python。

import mlflow.pyfunc
from mlflow.tracking import MlflowClient

run_id = MlflowClient()
trial_id = summary.best_trial.mlflow_run_id

model_uri = "runs:/{run_id}/model".format(run_id=trial_id)
pyfunc_model = mlflow.pyfunc.load_model(model_uri)

使用模型进行预测

predict_timeseries调用模型方法以生成预测。
在机器学习 10.5 或更高版本的 Databricks Runtime 中，可以将 include_history=False 设置为仅获取预测数据。

forecasts = pyfunc_model._model_impl.python_model.predict_timeseries()
display(forecasts)

# Option for Databricks Runtime for Machine Learning 10.5 or above
# forecasts = pyfunc_model._model_impl.python_model.predict_timeseries(include_history=False)

绘制预测点

在下面的绘图中，粗黑线显示时序数据集，蓝线是模型创建的预测。

df_true = df.groupby("date").agg(y=("cases", "avg")).reset_index().to_pandas()

import matplotlib.pyplot as plt

fig = plt.figure(facecolor='w', figsize=(10, 6))
ax = fig.add_subplot(111)
forecasts = pyfunc_model._model_impl.python_model.predict_timeseries(include_history=True)
fcst_t = forecasts['ds'].dt.to_pydatetime()
ax.plot(df_true['date'].dt.to_pydatetime(), df_true['y'], 'k.', label='Observed data points')
ax.plot(fcst_t, forecasts['yhat'], ls='-', c='#0072B2', label='Forecasts')
ax.fill_between(fcst_t, forecasts['yhat_lower'], forecasts['yhat_upper'],
                color='#0072B2', alpha=0.2, label='Uncertainty interval')
ax.legend()
plt.show()

注册并部署模型

可以像 MLflow 模型注册表中的其他任何模型一样注册和部署 AutoML 训练的模型。请参阅日志、加载和注册 MLflow 模型。

故障排除： `No module named pandas.core.indexes.numeric`

使用 Mosaic AI 模型服务来提供 AutoML 训练的模型时，您可能会看到错误 No module named pandas.core.indexes.numeric。当 AutoML 使用的版本与模型服务终结点环境中的版本不同时，会发生这种情况 pandas 。若要解决问题，请执行以下操作：

下载 add-pandas-dependency.py 脚本。脚本编辑 requirements.txt 和 conda.yaml 记录的模型要固定 pandas==1.5.3。
编辑脚本以纳入模型记录所在的 MLflow 运行 run_id。
重新注册模型。
部署新的模型版本。

示例笔记本

使用 AutoML Python API 训练预测模型

获取笔记本

后续步骤

AutoML Python API 参考。

反馈

此页面是否有帮助？

Last updated on 2026-05-03

使用 AutoML Python API 训练预测模型

要求

COVID-19 数据集

AutoML 训练

对模型进行迭代

显示最佳模型的预测结果

从最佳模型加载预测

使用模型进行预测

使用 MLflow 加载模型

使用模型进行预测

绘制预测点

注册并部署模型

故障 排除： No module named pandas.core.indexes.numeric

示例笔记本

使用 AutoML Python API 训练预测模型

后续步骤

反馈

其他资源

故障排除： `No module named pandas.core.indexes.numeric`