Use distributed training algorithms with Hyperopt
Note
The open-source version of Hyperopt is no longer being maintained.
Hyperopt will be removed in the next major DBR ML version. Azure Databricks recommends using Optuna for a similar experience and access to more up-to-date hyperparameter tuning algorithms.
In addition to single-machine training algorithms such as those from scikit-learn, you can use Hyperopt with distributed training algorithms. In this scenario, Hyperopt generates trials with different hyperparameter settings on the driver node. Each trial is executed from the driver node, giving it access to the full cluster resources. This setup works with any distributed machine learning algorithms or libraries, including Apache Spark MLlib and HorovodRunner.
When you use Hyperopt with distributed training algorithms, do not pass a trials
argument to fmin()
, and specifically, do not use the SparkTrials
class. SparkTrials
is designed to distribute trials for algorithms that are not themselves distributed. With distributed training algorithms, use the default Trials
class, which runs on the cluster driver. Hyperopt evaluates each trial on the driver node so that the ML algorithm itself can initiate distributed training.
Note
Azure Databricks does not support automatic logging to MLflow with the Trials
class. When using distributed training algorithms, you must manually call MLflow to log trials for Hyperopt.
The example notebook shows how to use Hyperopt to tune MLlib’s distributed training algorithms.
HorovodRunner is a general API used to run distributed deep learning workloads on Databricks. HorovodRunner integrates Horovod with Spark’s barrier mode to provide higher stability for long-running deep learning training jobs on Spark.
The example notebook shows how to use Hyperopt to tune distributed training for deep learning based on HorovodRunner.