Use XGBoost on Azure Databricks
This article provides examples of training machine learning models using XGBoost in Azure Databricks. Databricks Runtime for Machine Learning includes XGBoost libraries for both Python and Scala. You can train XGBoost models on an individual machine or in a distributed fashion.
Train XGBoost models on a single node
You can train models using the Python xgboost
package. This package supports only single node workloads. To train a PySpark ML pipeline and take advantage of distributed training, see Distributed training of XGBoost models.
XGBoost Python notebook
Distributed training of XGBoost models
For distributed training of XGBoost models, Databricks includes PySpark estimators based on the xgboost
package. Databricks also includes the Scala package xgboost-4j
. For details and example notebooks, see the following:
- Distributed training of XGBoost models using xgboost.spark (Databricks Runtime 12.0 ML and above)
- Distributed training of XGBoost models using sparkdl.xgboost (deprecated starting with Databricks Runtime 12.0 ML)
- Distributed training of XGBoost models using Scala
Install XGBoost on Azure Databricks
If you need to install XGBoost on Databricks Runtime or use a different version than the one pre-installed with Databricks Runtime ML, follow these instructions.
Install XGBoost on Databricks Runtime ML
XGBoost is included in Databricks Runtime ML. You can use these libraries in Databricks Runtime ML without installing any packages.
For the version of XGBoost installed in the Databricks Runtime ML version you are using, see the release notes. To install other Python versions in Databricks Runtime ML, install XGBoost as a Databricks PyPI library. Specify it as the following and replace <xgboost version>
with the desired version.
xgboost==<xgboost version>
Install XGBoost on Databricks Runtime
Python package: Execute the following command in a notebook cell:
%pip install xgboost
To install a specific version, replace <xgboost version>
with the desired version:
%pip install xgboost==<xgboost version>
- Scala/Java packages: Install as a Databricks library with the Spark Package name
xgboost-linux64
.