What is supposed to be included in the "system" section of the Azure OpenAI fine-tuning format?

Question

What is supposed to be included in the "system" section of the Azure OpenAI fine-tuning format?

Mason 25

In my RAG model, the following is being used by it:

Prompt
Context
User question
Chat History

The following link has the documentation on the finetuning: https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=azure-openai%2Cturbo%2Cpython-new&pivots=programming-language-studio#multi-turn-chat-file-format-azure-openai. What does the 'system' section of the JSONL file need? Just the prompt and the context?

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-12-02T04:59:43.4366667+00:00

@Mason Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Just following up to check if you had a chance to look at the below answer. If that helps, Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

1 answer

Your answer

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-12-02T04:59:43.4366667+00:00

@Mason Welcome to Microsoft Q&A Forum, Thank you for posting your query here!

.

Just following up to check if you had a chance to look at the below answer. If that helps, Please do not forget to "Accept the answer” and “up-vote” wherever the information provided helps you, this can be beneficial to other community members.

Answer 1

Marcin Policht 50,895 MVP Volunteer Moderator

In the fine-tuning JSONL file format for Azure OpenAI's multi-turn chat models, the system section represents the instructions or setup provided to the model before interacting with the he system section typically includes:

Prompt: Instructions about the role the model should take or the task it should perform (e.g., "You are a helpful assistant.").
Context: Background information relevant to the conversation to help the model understand the user's question better and provide accurate responses.

In your RAG model, since you're working with Prompt and Context specifically, these elements should be encapsulated in the system section. Here's an example structure:

{
    "messages": [
        {
            "role": "system",
            "content": "You are a helpful assistant. Here is some context to assist you: {Insert Context}"
        },
        {
            "role": "user",
            "content": "User question goes here"
        },
        {
            "role": "assistant",
            "content": "Model's response goes here"
        }
    ]
}

Include Prompt and Context as part of the system role to define the model's behavior and provide necessary background.
The system role does not need the user question or chat history—these belong to the user and assistant roles, respectively, in subsequent turns.
If you want chat history for continuity, include it explicitly in the user or assistant roles in the conversation.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Mason 25 Reputation points

2024-12-02T13:33:10.1633333+00:00

This is the start of my prompt in azure prompt flow:

Which one of the following should I use for the fine tuning format (used the short prompt below for demonstration purposes)?

Option 1:
{ "role": "system", "content": "system:\n*You are a helpful assistant. Here is some context to assist you: {Insert Context}" },

The other option is to leave 'system' out. Option 2:
{ "role": "system", "content": "\n*You are a helpful assistant. Here is some context to assist you: {Insert Context}" },

Option 3:
{ "role": "system", "content": "*You are a helpful assistant. Here is some context to assist you: {Insert Context}" }

Option 4:
{ "role": "system", "content": "You are a helpful assistant. Here is some context to assist you: {Insert Context}" },

Option 5:
{ "role": "system", "content": "\nYou are a helpful assistant. Here is some context to assist you: {Insert Context}" },
Mason 25 Reputation points

2024-12-02T13:36:43.3366667+00:00

Thank you.
Marcin Policht 50,895 Reputation points MVP Volunteer Moderator

2024-12-02T14:22:19.87+00:00
The optimal choice depends on how strictly you want to adhere to formatting conventions and whether the model you're fine-tuning requires explicit labels like "system:" or special formatting like newlines. However, in most cases for Azure OpenAI fine-tuning with multi-turn chat models, Option 4 is likely the cleanest and most effective format:

No Redundant Tags: There's no need to prefix the content with "system:" or additional formatting like asterisks (*). The "role": "system" already provides this context.

Minimal Noise: Keeping the content simple ensures the model focuses on the instruction rather than extraneous formatting or symbols.

Compatibility: Azure OpenAI fine-tuning typically expects plain text content within the "content" field, and unnecessary symbols might inadvertently confuse the model.

Clarity: A clear and straightforward instruction is sufficient to define the assistant's role.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin
Mason 25 Reputation points

2024-12-02T15:48:48.02+00:00

I guess my goal is to finetune the model with the exact prompt/chathistory/chatinput format that it is getting now. Trying to do this because I have literally changed punctuation before and it changes how it answers questions inputted into it.

Right now it gets this:

system:
*You ...

I would like all that information to be captured in the finetuning. So I guess my question is how does Microsoft use the JSONL format in the fine tuning? I assumed it went from (for example)

{ "role": "system", "content": "\n*You..." }

to

"system:\n*You ..."

Which is what I would want. But I do not know if it moves 'role' to the front of the 'content' and adds a colon before it is used to train or not. If it does not, I think I should add a 'system' before the prompt in the 'content section'. It is not in any documentation that I have found.
Marcin Policht 50,895 Reputation points MVP Volunteer Moderator

2024-12-02T16:54:32.4766667+00:00
Your question addresses a subtle consideration in fine-tuning Azure OpenAI models: how the input format is interpreted during training and whether the role field in the JSONL file affects the training data's content directly. To clarify:

The role field in the JSONL fine-tuning format (e.g., system, user, assistant) is metadata that tells the model how to interpret the corresponding content field. It does not prepend the role (like "system:") to the content field automatically.

During training, the model only sees the content field text. The role information helps guide the training process to associate behaviors with each type of role.

Microsoft Azure OpenAI models use the JSONL format as a guideline for structuring conversation history and context.

The role field is metadata and does not inherently modify the content field during training. For instance:
{ "role": "system", "content": "\n*You..." }
Would train the model on:
\n*You...
without any additional system: prepended.

If the way the model behaves is sensitive to punctuation or formatting, you must explicitly include such details in the content field. For example:

If the model's current format expects system: at the beginning of the prompt:
{ "role": "system", "content": "system:\n*You are a helpful assistant. Here is some context: {Insert Context}" }
This ensures the exact behavior you're expecting is preserved during fine-tuning.

Fine-tuning data should exactly replicate the format the model sees during inference to ensure consistent behavior. If your system today uses system:\n*You... as part of its context, that formatting must be included directly in the content.

Given your use case, where the format impacts the response:

Explicitly include system: or any other prefixes you use today in the content field. For example:
{ "role": "system", "content": "system:\n*You are a helpful assistant. Here is some context to assist you: {Insert Context}" }

Ensure that all punctuation, spacing, and line breaks match the format the model receives in production.

To validate this approach:

Use the content field directly in the Playground or API with the exact desired format (e.g., system:\n*You...).

Ensure the model behaves as expected before proceeding with fine-tuning.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth

Marcin

Mason 25

How would that work for the 'user' and 'assistant' sections? The format is the following if I have 2 back and forths with the bot:

{ "role": "system", "content": "You are a helpful assistant. Here is some context to assist you: {Insert Context}" }, { "role": "user", "content": "User question goes here" }, { "role": "assistant", "content": "Model's response goes here", "weight":0 }, { "role": "user", "content": "User question goes here" }, { "role": "assistant", "content": "Model's response goes here", "weight":1 }

Do I have to put 'user: ' and 'assistant: ' before each of the respective bot responses and inputs to match the format? This is the end of my prompt in azure prompt flow:

User's image

navba-MSFT 27,550 Reputation points Microsoft Employee Moderator

2024-12-04T04:00:46.8+00:00
@Mason For fine-tuning your Azure OpenAI model, the "system" section in the JSONL file should include the initial instructions or context that you want the model to follow throughout the conversation. This section sets the stage for the assistant's behavior and tone.

Based on the documentation and your goal to maintain the exact format, here are some clarifications and recommendations:

System Section: The "system" role should contain the initial instructions. It doesn't need to include the word "system" within the content itself. The role already indicates that it's a system message.

Formatting Options: Among the options you provided, Option 4 is the most appropriate. It clearly defines the role and content without unnecessary characters or formatting that might confuse the model.

{ "role": "system", "content": "You are a helpful assistant. Here is some context to assist you: {Insert Context}" }

Role and Content: The "role" field should not be included in the "content" field. The JSONL format uses the "role" to specify the type of message (system, user, assistant), and the "content" field to specify the actual message.

User and Assistant Sections: You do not need to prepend "user: " or "assistant: " within the content. The role field already specifies who is speaking. Here’s how a multi-turn conversation should look:

[ { "role": "system", "content": "You are a helpful assistant. Here is some context to assist you: {Insert Context}" }, { "role": "user", "content": "User question goes here" }, { "role": "assistant", "content": "Model's response goes here", "weight": 0 }, { "role": "user", "content": "User question goes here" }, { "role": "assistant", "content": "Model's response goes here", "weight": 1 } ]

Weight Field: The "weight" field is optional and can be used to influence the importance of specific responses during fine-tuning. A weight of 1 means the response is important, while 0 means it can be skipped during training.
Mason 25 Reputation points

2024-12-04T13:22:43.11+00:00

Thanks for your patience with this. I do want to note that this is the beginning of what is sent to the bot . It seems like from your last comment that 'system:' will automatically be applied so I don't have to add it to the 'content' section. Shouldn't '\n*' be included if I am trying to keep the same format?

I think the mistake might be me adding the \n* to the prompt in the first place. But, if it is left in, should it be in the fine tuning?
Marcin Policht 50,895 Reputation points MVP Volunteer Moderator

2024-12-04T17:58:02.47+00:00
Your questions are fully justified. Here is what you should consider:

Does system: Automatically Apply?

The role: system metadata does NOT automatically prepend system: to the content field during fine-tuning or inference. If your input text starts with system:, it needs to be explicitly included in the content.

The role helps Azure OpenAI models interpret the context (e.g., instructions from the system) but does not alter the content format.

Should \n* Be Included in Fine-Tuning? If \n* is part of the format the model sees today in production (even if it was added mistakenly), it should be included in the fine-tuning data to ensure consistency. Fine-tuning relies heavily on the exact input-output format provided in the training data.

Effectively:

Include it if:

The current model is trained or expected to process inputs with \n* and generates desirable responses in that format.

You want the fine-tuned model to mimic behavior from past interactions precisely.

Omit it if:

\n* was added accidentally and is unnecessary for the model’s interpretation or output.

Removing it doesn’t negatively impact how the model answers questions.

If \n* is left in:

The model will learn to treat it as part of the input text structure.

It will likely influence how the model generates responses, especially if punctuation, line breaks, or symbols affect its interpretation.

Given your goal to maintain the current behavior:

Verify Current Behavior:

Test prompts with and without \n* in the Playground or API to see if it impacts the responses.

If it changes behavior, keep \n* in the fine-tuning data.

Include \n* If Keeping Consistency:
{ "role": "system", "content": "system:\n*You are a helpful assistant. Here is some context to assist you: {Insert Context}" }

Update Future Prompts:

If \n* isn’t needed, consider reformatting future inputs and removing it from fine-tuning data to simplify and normalize interactions.

If the above response helps answer your question, remember to "Accept Answer" so that others in the community facing similar issues can easily find the solution. Your contribution is highly appreciated.

hth Marcin

Share via

What is supposed to be included in the "system" section of the Azure OpenAI fine-tuning format?

1 answer

Your answer