Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
This article describes how to use MLflow, monitor GPU health, view logs, and manage model checkpoints on Serverless GPU Compute.
MLflow integration
Serverless GPU compute integrates natively with MLflow for experiment tracking, model logging, and metric visualization.
Setup recommendations:
Upgrade MLflow to version 3.7 or newer and follow the deep learning workflow patterns.
Enable autologging for PyTorch Lightning:
import mlflow mlflow.pytorch.autolog()Customize your MLflow run name by encapsulating your model training code within the
mlflow.start_run()API scope. This gives you control over the run name and enables you to restart from a previous run.You can customize the run name using therun_nameparameter inmlflow.start_run(run_name="your-custom-name")or in third-party libraries that support MLflow (for example, Hugging Face Transformers). Otherwise, the default run name isjobTaskRun-xxxxx.from transformers import TrainingArguments args = TrainingArguments( report_to="mlflow", run_name="llama7b-sft-lr3e5", # <-- MLflow run name logging_steps=50, )The serverless GPU API automatically launches an MLflow experiment with default name
/Users/{WORKSPACE_USER}/{get_notebook_name()}. Users can overwrite it with the environment variableMLFLOW_EXPERIMENT_NAME.Always use absolute paths for theMLFLOW_EXPERIMENT_NAMEenvironment variable:import os os.environ["MLFLOW_EXPERIMENT_NAME"] = "/Users/<username>/my-experiment"Resume previous training by setting the
MLFLOW_RUN_IDfrom the earlier run:mlflow.start_run(run_id="<previous-run-id>")Set the
stepparameter inMLFlowLoggerto reasonable batch numbers. MLflow has a limit of 10 million metric steps — logging every single batch on large training runs can hit this limit. See Resource limits.
GPU health monitoring
Monitor GPU utilization and health directly through the Databricks notebook UI. GPU metrics (utilization, memory usage, temperature) are available in the compute panel when connected to Serverless GPU Compute.
Viewing logs
- Notebook output — Standard output and errors from your training code appear in the notebook cell output.
- Driver logs — Accessible via the compute panel for debugging startup issues, environment setup problems, and runtime errors.
- MLflow logs — Training metrics, parameters, and artifacts are viewable in the MLflow experiment UI.
Model checkpoint
Save model checkpoints to Unity Catalog volumes, which provide the same governance as other Unity Catalog objects. Use the following path format to reference files in volumes from a Databricks notebook:
/Volumes/<catalog>/<schema>/<volume>/<path>/<file-name>
Save checkpoints to volumes the same way you save them to local storage.
The example below shows how to write a PyTorch checkpoint to Unity Catalog volumes:
import torch
checkpoint = {
"epoch": epoch, # last finished epoch
"model_state_dict": model.state_dict(), # weights & buffers
"optimizer_state_dict": optimizer.state_dict(), # optimizer state
"loss": loss, # optional current loss
"metrics": {"val_acc": val_acc}, # optional metrics
# Add scheduler state, RNG state, and other metadata as needed.
}
checkpoint_path = "/Volumes/my_catalog/my_schema/model/checkpoints/ckpt-0001.pt"
torch.save(checkpoint, checkpoint_path)
This approach also works for distributed checkpoints. The example below shows distributed model checkpointing with the Torch Distributed Checkpoint API:
import torch.distributed.checkpoint as dcp
def save_checkpoint(self, checkpoint_path):
state_dict = self.get_state_dict(model, optimizer)
dcp.save(state_dict, checkpoint_id=checkpoint_path)
trainer.save_checkpoint("/Volumes/my_catalog/my_schema/model/checkpoints")
Multi-user collaboration
- To ensure all users can access shared code (for example, helper modules or environment YAML files), store them in
/Workspace/Sharedinstead of user-specific folders like/Workspace/Users/<your_email>/. - For code that is in active development, use Git folders in user-specific folders
/Workspace/Users/<your_email>/and push to remote Git repos. This allows multiple users to have a user-specific clone and branch, while still using a remote Git repo for version control. See best practices for using Git on Databricks. - Collaborators can share and comment on notebooks.
Global limits in Databricks
See Resource limits.