Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models by training them through a reward-based process, rather than relying only on labeled data. RFT helps models develop better reasoning and problem-solving skills, especially in cases where labeled examples are limited or complex behaviors are desired.
Note
The fine-tuning service automatically pauses RFT jobs once they hit $5,000 in total training costs (training + grading). You can deploy the most recent checkpoint or resume the training job. If you decide to resume the job, billing continues for the job with no further cost-based limits.
Model support
Reinforcement fine-tuning is supported for the following models:
| Model | Version | RFT support | Status |
|---|---|---|---|
o4-mini |
2025-04-16 |
Yes | GA |
gpt-5 |
2025-08-07 |
Yes | Private preview |
Note
GPT-5 support for reinforcement fine-tuning is in private preview and might not be available in your subscription.
Requirements
Reinforcement fine-tuning (RFT) requires training and validation data formatted as JSONL and containing a messages array that uses the chat completions format.
However, RFT has more requirements:
- Data
- Assign the final "message" in the data a
userrole. - Include extra fields and values for use by a grader.
- Provide both a training and a validation dataset.
- Assign the final "message" in the data a
- Graders
- Define a grader to score the quality of your fine-tuned model and guide learning.
- Provide only a single grader, but you can combine multiple graders by using a multigrader.
Example training data
The following example shows how to present prompts to the model and include ground truth that a grader can access.
{
"messages": [
{
"role": "developer",
"content": "Your task is to solve logic puzzles. The user will provide an expression with ?'s as placeholders for arithmetic operations. Replace the ?'s with arithmetic operation signs (+, -, *, /) to obtain a valid equation."
},
{
"role": "user",
"content": "1 ? 2 ? 3 ? 4 ? 5 ? 6 ? 7 ? 8 ? 9 = 100"
}
],
"solution": "1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 * 9 = 100"
}
Note
This example is split across multiple lines for demonstration purposes only. It must be a single line in your JSONL file.
Graders
Graders provide the reward function used during training and have access to any user-supplied fields in the dataset. Multiple types of graders are available:
- text comparison: score response content based on its text
- model: score responses using a language model and prompt
- custom code: score responses using custom code
- multigrader: score based on a combination of scores from other graders
Most graders perform substitution of runtime data via templates. Any input or reference properties can include variable substitution enclosed in double curly braces ({{ }}) containing a reference to a variable.
Each template reference must be namespaced using a pattern like {{ namespace.variable }}. For any complex, nested data, a JSON-path like syntax is supported.
The following namespaces are supported:
sample- model output to be graded appears under thesamplenamespace in a format similar to a chat completions response.item- optional, extra fields provided in training data appear under theitemnamespace.
Some examples of template substitution that use the above namespaces:
{{ sample.output_text }}- substitute the model output as a string{{ sample.output_json }}- if the model produced structured outputs, reference it as JSON{{ item.answer }}- substitute the "answer" field in the dataset{{ item.ground_truth.date }}- substitute the "date" field of a "ground_truth" object defined in the dataset
The following sections document individual graders and provide their JSON specification for defining via the API.
Text comparison graders
Use text comparison graders when the use case requires the model output to be either a definitive label or if the output must resemble a known ground truth answer.
String-check-grader
String-check graders apply a given operation to the input and a reference to return a 0 or 1, providing a simple pass/fail function.
{
"type": "string_check",
"name": string,
"operation": "eq" | "ne" | "like" | "ilike",
"input": string,
"reference": string,
}
Operations:
| Operation | Returns 1 when | Case-sensitive |
|---|---|---|
eq |
Input matches reference | Yes |
ne |
Input doesn't match reference | Yes |
like |
Input contains reference | Yes |
ilike |
Input contains reference | No |
Text similarity
Text-similarity graders compute a score based on a select algorithm for quantifying similarity between the input text and a given reference text.
Specification:
{
"type": "text_similarity",
"name": string,
"input": string,
"reference": string,
"pass_threshold": number,
"evaluation_metric": "fuzzy_match" | "bleu" | "gleu" | "meteor" | "rouge_1" | "rouge_2" | "rouge_3" | "rouge_4" | "rouge_5" | "rouge_l"
}
Evaluation metrics:
fuzzy_match– fuzzy string match, using the RapidFuzz algorithmbleu– computes BLEU (bilingual evaluation understudy) score between stringsgleu– computes Google BLEU score between stringsmeteor– computes METEOR score between stringsrouge-*- as defined by the rouge python library
Model graders
Model graders take a prompt to a grader model that instructs it how to evaluate and score a given response. This flexibility allows for prompt engineering complex graders that support explaining the reason for a given score.
Use the following models as model graders:
| Model | Can be used as grader |
|---|---|
gpt-4o-2024-08-06 |
Yes |
o3-mini-2025-01-31 |
Yes |
Note
Model graders don't require model deployments in Foundry.
Score model
Score model graders output a numeric score based on their given input and prompt. Any provided sampling_params control the behavior of the scoring model and allow for customizing things like temperature and reasoning effort.
{
"type": "score_model",
"name": string,
"input": Message[],
"model": string,
"pass_threshold": number,
"range": number[],
"sampling_params": object
}
Code graders
Model graders are flexible but nondeterministic. When you need deterministic scoring, use code graders instead.
Python grader
The Python grader executes arbitrary Python code to produce a score.
The provided code must define a grade function that takes two positional arguments: sample and item. The function must return a numeric score.
{
"type": "python",
"name": string,
"source": "def grade(sample, item):\n return 1.0"
}
The Python code runs in a constrained environment with the following limitations:
| Resource | Limit |
|---|---|
| Code size | 256 KB |
| Network | No access |
| Memory | 2 GB |
| Disk space | 1 GB |
| CPU | 1 core |
| Runtime | 2 minutes |
Tip
Your code should handle any possible errors and always return a numeric value. If too many exceptions occur during the execution of the grader, the training job fails.
Within the Python runtime, the provided code can use the following modules and versions:
- numpy==2.2.4
- scipy==1.15.2
- sympy==1.13.3
- pandas==2.2.3
- rapidfuzz==3.10.1
- scikit-learn==1.6.1
- rouge-score==0.1.2
- deepdiff==8.4.2
- jsonschema==4.23.0
- pydantic==2.10.6
- pyyaml==6.0.2
- nltk==3.9.1
- sqlparse==0.5.3
- rdkit==2024.9.6
- scikit-bio==0.6.3
- ast-grep-py==0.36.2
Endpoint grader (preview)
Endpoint graders call a remote endpoint through an HTTP API to score the model response. They're perfect for use cases that require access to ground truth for accurate scoring or the ability to implement the grader in a language other than Python.
While in private preview, the API for endpoint graders isn't published.
Multigrader
A multigrader combines the output of multiple graders to produce a single score based on an arithmetic expression provided in calculate_output.
{
"type": "multi",
"name": string,
"graders": dict[str, Grader],
"calculate_output": string
}
When a multigrader computes the score, the calculate_output expression references the individual scores from the provided graders by the key in the graders object.
Operators:
| Operator | Description |
|---|---|
+ |
Addition |
- |
Subtraction |
* |
Multiplication |
/ |
Division |
^ |
Power |
Functions:
| Function | Description |
|---|---|
min |
Compute the minimum of a value |
max |
Compute the maximum of a value |
abs |
Compute the absolute value |
floor |
Round the value down |
ceil |
Round the value up |
exp |
Compute e to the power of the provided value |
sqrt |
Take the square root of the value |
log |
Compute the logarithm of the provided value |
As an example, a multigrader defined with two graders, "similarity-score" and "label-checker," that must average their outputs could look like:
{
"type": "multi",
"name": "Example multigrader",
"graders": {
"similarity_score": {
"type": "text_similarity",
"name": "similarity grader",
"input": "{{ sample.output_text }}",
"reference": "{{ item.summary }}",
"evaluation_metric": "bleu"
},
"label_checker": {
"type": "string_check",
"name": "label grader",
"input": "{{ sample.output_text }}",
"reference": "{{ item.label }}",
"operation": "eq"
}
},
"calculate_output": "(similarity_score + label_checker) / 2"
}
Response format (optional)
During training, you can configure the model to produce structured outputs. This structure can align with the intended use case of the model or make grading the output easier.
The response format configuration follows the same specification as Chat Completions, either supporting text (the default) or JSON. When the model should output JSON, you must provide a JSON Schema.
To continue with the previous example, if the model must output the response in a structured format such as:
{ "solution": "1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 * 9 = 100" }
The following JSON schema describes the response format:
{
"type": "json_schema",
"name": "puzzles_assistant",
"schema": {
"type" : "object",
"properties": {
"solution": {
"type": "string",
"title": "solution"
}
},
"required": [
"solution",
],
"additionalProperties": false
},
"strict": true
}
Hyperparameter selection
Reinforcement fine-tuning supports the same hyperparameters as supervised fine-tuning. Additionally, the following hyperparameters control features specific to RFT:
| Hyperparameter name | Value | Default | Description |
|---|---|---|---|
eval_interval |
integer | auto |
The number of training steps between evaluation runs. |
eval_samples |
integer | auto |
The number of samples to use during evaluation. |
compute_multiplier |
number | auto |
The multiplier on amount of compute use for exploring space during training. |
reasoning_effort |
low, medium, high |
medium |
The reasoning effort used by the model during training. |
Note
The training service automatically replaces hyperparameters set to auto with defaults based on heuristics on the provided training data.
Interpreting training results
Reinforcement fine-tuning provides both automatic evaluations of the model during training and real-time training metrics.
Training metrics
When you monitor a running job or inspect a completed job, the reward and reasoning metrics provide an indicator of training success.
Reward
Reward metrics track the resulting scores from the grader acting as the reward function.
train_reward_mean: the average reward across the batch of training data at a given step. Because each batch might be different across steps, the trend of this metric is more important than comparing values across steps.valid_reward_mean: the average reward across the samples taken from the validation set at a given step.
Reward metrics should generally increase over the course of the training job. If they diverge significantly, it's a sign the model might be reward hacking and the grader requires more engineering.
Reasoning tokens
Each training job tracks the number of reasoning tokens produced by the model. Reasoning token metrics capture how the model changes its behavior over the lifetime of the training job.
train_reasoning_tokens_mean: the average number of reasoning tokens produced across the batch of training data at a given step.valid_reasoning_tokens_mean: the average number of reasoning tokens produced across the validation data at a given step.
The model might learn to use fewer reasoning tokens to achieve the same reward, or it might learn to use more reasoning tokens to achieve a higher reward. These metrics typically rise and fall during the training job.
Automatic evaluations
The system automatically creates an evaluation for each RFT job. At regular intervals defined by the eval_interval hyperparameter, the training system executes an evaluation run by using the validation data. You can view scores for each run through the linked evaluation, which you can discover from the Foundry user interface.
Inspecting these evaluations provides an extra data point for deciding on early stopping. If the model exhibits learning during training, the results of each evaluation run should improve over the lifetime of the job.
Example projects and datasets
The following example demos and datasets provide starting points for new users of reinforcement fine-tuning:
- Countdown Demo - end-to-end demo of using RFT to improve mathematical reasoning.
- MedMCQ - sample dataset and graders for answering multiple-choice questions from the medical domain.
- ClauseMatching - sample dataset and graders showcasing both summarization and content interpretation in the legal domain.