Prepare data for Mosaic AI Model Training

Important

This feature is in Public Preview in the following regions: centralus, eastus, eastus2, northcentralus, and westus.

This article describes the accepted training and evaluation data file formats for the Mosaic AI Model Training (formerly Foundation Model Training) tasks: supervised fine-tuning, chat completion, and continued pre-training.

The following notebook shows how to validate your data. It is designed to be run independently before you begin training. The purpose of this notebook is to validate that your data is in the correct format for Mosaic AI Model Training. It also includes code to tokenize your raw dataset to help you estimate costs during your training run.

Validate data for training runs notebook

Get notebook

Prepare data for supervised fine-tuning

For supervised fine-tuning tasks, the training data can be in one of the following schemas:

  • Prompt and response pairs.

    {"prompt": "your-custom-prompt", "response": "your-custom-response"}
    
  • Prompt and completion pairs.

    {"prompt": "your-custom-prompt", "completion": "your-custom-response"}
    

Note

Prompt-response and prompt-completion are not templated, so any model-specific templating, such as Mistral’s instruct formatting must be performed as a preprocessing step.

Accepted data formats are:

  • A Unity Catalog Volume with a .jsonl file. The training data must be in JSONL format, where each line is a valid JSON object. The following example shows a prompt and response pair example:

    {"prompt": "What is Databricks?","response": "Databricks is a cloud-based data engineering platform that provides a fast, easy, and collaborative way to process large-scale data."}
    
  • A Delta table that adheres to one of the accepted schemas mentioned above. For Delta tables, you must provide a data_prep_cluster_id parameter for data processing. See Configure a training run.

  • A public Hugging Face dataset.

    If you use a public Hugging Face dataset as your training data, specify the full path with the split, for example, mosaicml/instruct-v3/train and mosaicml/instruct-v3/test. This accounts for datasets that have different split schemas. Nested datasets from Hugging Face are not supported.

    For a more extensive example, see the mosaicml/dolly_hhrlhf dataset on Hugging Face.

    The following example rows of data are from the mosaicml/dolly_hhrlhf dataset.

    {"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: what is Databricks? ### Response: ","response": "Databricks is a cloud-based data engineering platform that provides a fast, easy, and collaborative way to process large-scale data."}
    {"prompt": "Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Van Halen famously banned what color M&Ms in their rider? ### Response: ","response": "Brown."}
    

Prepare data for chat completion

For chat completion tasks, chat-formatted data must be in a file .jsonl format, where each line is a separate JSON object representing a single chat session. Each chat session is represented as a JSON object with a single key, "messages", that maps to an array of message objects. To train on chat data, simply provide the task_type = 'CHAT_COMPLETION'.

Messages in chat format are automatically formatted according to the model’s chat template, so there is no need to add special chat tokens to signal the beginning or end of a chat turn manually. An example of a model that uses a custom chat template is Mistral-instruct.

Note

Mistral models do not accept system roles in their data formats.

Each message object in the array represents a single message in the conversation and has the following structure:

  • role: A string indicating the author of the message. Possible values are "system", "user", and "assistant". If the role is system, it must be the first chat in the messages list. There must be at least one message with the role "assistant", and any messages after the (optional) system prompt must alternate roles between user/assistant. There must not be two adjacent messages with the same role. The last message in the "messages" array must have the role "assistant".
  • content: A string containing the text of the message.

The following is a chat-formatted data example:

{"messages": [
  {"role": "system", "content": "A conversation between a user and a helpful assistant."},
  {"role": "user", "content": "Hi there. What's the capital of the moon?"},
  {"role": "assistant", "content": "This question doesn't make sense as nobody currently lives on the moon, meaning it would have no government or political institutions. Furthermore, international treaties prohibit any nation from asserting sovereignty over the moon and other celestial bodies."},
  ]
}

Prepare data for continued pre-training

For continued pre-training tasks, the training data is your unstructured text data. The training data must be in a Unity Catalog volume containing .txt files. Each .txt file is treated as a single sample. If your .txt files are in a Unity Catalog volume folder, those files are also obtained for your training data. Any non-txt files in the volume are ignored. See Upload files to a Unity Catalog volume.

The following image shows example .txt files in a Unity Catalog volume. To use this data in your continued pre-training run configuration, set train_data_path = "dbfs:/Volumes/main/finetuning/cpt-data".

UC Volume with continued pre-training dataset file examples