View, manage, and analyze Mosaic AI Model Training runs

Important

This feature is in Public Preview in the following regions: centralus, eastus, eastus2, northcentralus, westcentralus, westus, and westus3. Reach out to your Databricks account team to enroll in the Public Preview.

This article describes how to view, manage, and analyze Mosaic AI Model Training (formerly Foundation Model Training) runs using APIs or using the UI.

For information on creating runs, see Create a training run using the Mosaic AI Model Training API and Create a training run using the Mosaic AI Model Training UI.

Use Mosaic AI Model Training APIs to view and manage training runs

The Mosaic AI Model Training APIs provide the following functions for managing your training runs.

Get a run

Use the get() function to return a run by name or run object you have launched.

from databricks.model_training import foundation_model as fm

fm.get('<your-run-name>')

List runs

Use the list() function to see the runs you have launched. The following table lists the optional filters you can specify.

Optional filter Definition
finetuning_runs A list of runs to get. Defaults to selecting all runs.
user_emails If shared runs is enabled for your workspace, you can filter results by the user who submitted the training run. Defaults to no user filter.
before A datetime or datetime string to filter runs before. Defaults to all runs.
after A datetime or datetime string to filter runs after. Defaults to all runs.
from databricks.model_training import foundation_model as fm

fm.list()

# filtering example
fm.list(before='01012023', limit=50)

Cancel training runs

To cancel a run, use the cancel() function and pass the run or a list of the training runs.

from databricks.model_training import foundation_model as fm

run_to_cancel = '<name-of-run-to-cancel>'
fm.cancel(run_to_cancel)

Delete training runs

Use delete() to delete training runs by passing a single run or a list of runs.

from databricks.model_training import foundation_model as fm

fm.delete('<name-of-run-to-delete>')

Review status of training runs

The following table lists the events created by a training run. Use the get_events() function anytime during your run to see your run’s progress.

Event type Example event message Definition
CREATED Run created. Training run was created. If resources are availabe, the run starts. Otherwise, it enters the Pending state.
STARTED Run started. Resources have been allocated, and the run has started.
DATA_VALIDATED Training data validated. Validated that training data is correctly formatted.
MODEL_INITIALIZED Model data downloaded and initialized for base model meta-llama/Llama-2-7b-chat-hf. Weights for the base model have been downloaded, and training is ready to begin.
TRAIN_UPDATED [epoch=1/1][batch=50/56][ETA=5min] Train loss: 1.71 Reports the current training batch, epoch, or token, estimated time for training to finish (not including checkpoint upload time) and train loss. This event is updated when each batch ends. If the run configuration specifies max_duration in tok units, progress is reported in tokens.
TRAIN_FINISHED Training completed. Training has finished. Checkpoint uploading begins.
COMPLETED Run completed. Final weights uploaded. Checkpoint has been uploaded, and the run has been completed.
CANCELED Run canceled. The run is canceled if fm.cancel() is called on it.
FAILED One or more train dataset samples has unknown keys. Please check the documentation for supported data formats. The run failed. Check event_message for actionable details, or contact support.
from databricks.model_training import foundation_model as fm

fm.get_events()

Use the UI to view and manage runs

To view runs in the UI:

  1. Click Experiments in the left nav bar to display the Experiments page.

  2. In the table, click the name of your experiment to display the experiment page. The experiment page lists all runs associated with the experiment.

    experiment page

  3. To display additional information or metrics in the table, click plus sign and select the items to display from the menu:

    add metrics to chart

  4. Additional run information is available in the Chart tab:

    chart tab

  5. You can also click on the name of the run to display the run screen. This screen gives you access to additional details about the run.

    run page

Checkpoints

To access the checkpoint folder, click the Artifacts tab on the run screen. Open the experiment name, and then open the checkpoints folder. These artifact checkpoints are not the same as the registered model at the end of a training run.

checkpoint folder on artifacts tab

There a few directories in this folder:

  • The epoch folders (named ep<n>-xxx) contain the weights and model states at each Composer checkpoint. Composer checkpoints are saved periodically through training, these are used for resuming a fine-tuning training run and continued fine-tuning. This checkpoint is the one you pass in as the custom_weights_path to start another training run from those weights, see Build on custom model weights.
  • In the huggingface folder, Hugging Face checkpoints are also saved periodically through training. After you download the content in this folder, you can load these checkpoints like you would with any other Hugging Face checkpoint using AutoModelForCausalLM.from_pretrained(<downloaded folder>).
  • The checkpoints/latest-sharded-rank0.symlink is a file that holds the path to the latest checkpoint, that you can use to resume training.