Skip to content

Commit 9a26c08

Browse files
committed
acrolinx
1 parent 33a5e46 commit 9a26c08

File tree

1 file changed

+34
-34
lines changed

1 file changed

+34
-34
lines changed

articles/ai-services/openai/how-to/reinforcement-fine-tuning.md

Lines changed: 34 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -12,13 +12,13 @@ ms.author: mbullwin
1212

1313
# Reinforcement fine-tuning (RFT) with Azure OpenAI o4-mini
1414

15-
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models like o4-mini by training them through a reward-based process, rather than relying only on labelled data. By using feedback or "rewards" to guide learning, RFT helps models develop better reasoning and problem-solving skills, especially in cases where labelled examples are limited or complex behaviors are desired.
15+
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models like o4-mini by training them through a reward-based process, rather than relying only on labeled data. By using feedback or "rewards" to guide learning, RFT helps models develop better reasoning and problem-solving skills, especially in cases where labeled examples are limited or complex behaviors are desired.
1616

1717
## Process
1818

1919
The process of Reinforcement fine-tuning (RFT) is similar to Supervised fine-tuning (SFT) with some notable differences:
2020

21-
- **Data preparation** system messages are not supported, and instead of an `assistant` message, the final message in your training data is a reference answer.
21+
- **Data preparation** system messages aren't supported, and instead of an `assistant` message, the final message in your training data is a reference answer.
2222
- **Model selection:** only o4-mini supports RFT.
2323
- **Grader definition:** RFT requires the use of **graders** to score the quality of your fine-tuned model and guide learning. You can use string check, text similarity, or model-based graders – or combine them with a multi-grader.
2424
- **Training:** includes additional parameters: `eval_samples`, `eval_interval`, `reasoning_effort`, and `compute_multiplier`. You also have the option to pause and resume jobs, allowing you to pause training, inspect checkpoints, and only continue if further training is needed.
@@ -30,19 +30,19 @@ However, despite these differences, there are many commonalities between SFT and
3030

3131
## Training & evaluation file formation requirements
3232

33-
Both training and validation files are required to run o4-mini RFT. o4 -mini uses a new format of data for reinforcement finetuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
33+
Both training and validation files are required to run o4-mini RFT. o4 -mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
3434

3535
Each line of the file should contain the messages field, with some differences to SFT:
3636

37-
- System messages are not supported
37+
- System messages aren't supported
3838
- The final message must be from the user, not the assistant (as is the case for SFT)
3939
- `Tools`, `functions`, `response_formats`, are supported
40-
- Images / multimodal data are not supported
40+
- Images / multimodal data aren't supported
4141

4242
Each line must include a new field called reference answer
4343

4444
- `Reference_answer` contains the data used by your grader to determine the correctness of the answer.
45-
- This value must be a valid JSON object (fore example, dictionary or list; the specific type and structure is dependent on your selected grader).
45+
- This value must be a valid JSON object (for example, dictionary, or list; the specific type and structure is dependent on your selected grader).
4646

4747
We currently support a maximum of 50,000 reference examples for training.
4848

@@ -103,7 +103,7 @@ The hyperparameters section of the **reinforcement** method supports all of the
103103
104104
## Graders
105105

106-
RFT is unique because it uses graders to assess the quality of a model’s response to teach the model to reason. Unlike SFT, the final message is not from the assistant – instead we sample the model and use a grader on each sample to score its quality. We then train based on those scores to improve model performance.
106+
RFT is unique because it uses graders to assess the quality of a model’s response to teach the model to reason. Unlike SFT, the final message isn't from the assistant – instead we sample the model and use a grader on each sample to score its quality. We then train based on those scores to improve model performance.
107107

108108
Effectively, graders are functions that compare the reference_answer from your training file with the sampled response.
109109

@@ -115,11 +115,11 @@ Effectively, graders are functions that compare the reference_answer from your t
115115

116116
### Supported graders
117117

118-
We support three types of graders: String check ,Text similarity, Model graders. There is also an option on Multi-graders to use graders in combinations.
118+
We support three types of graders: String check, Text similarity, Model graders. There's also an option on Multi-graders to use graders in combinations.
119119

120-
### string-check-grader
120+
### String-check-grader
121121

122-
Use these simple string operations to return a 0 or 1.
122+
Use these basic string operations to return a `0` or `1`.
123123

124124
**Specification:**
125125

@@ -135,14 +135,14 @@ Use these simple string operations to return a 0 or 1.
135135

136136
**Supported operations:**
137137

138-
- eq: Returns 1 if the input matches the reference (case-sensitive), 0 otherwise
139-
- neq: Returns 1 if the input does not match the reference (case-sensitive), 0 otherwise
140-
- like: Returns 1 if the input contains the reference (case-sensitive), 0 otherwise
141-
- ilike: Returns 1 if the input contains the reference (not case-sensitive), 0 otherwise
138+
- `eq`: Returns 1 if the input matches the reference (case-sensitive), 0 otherwise
139+
- `neq`: Returns 1 if the input doesn't match the reference (case-sensitive), 0 otherwise
140+
- `like`: Returns 1 if the input contains the reference (case-sensitive), 0 otherwise
141+
- `ilike`: Returns 1 if the input contains the reference (not case-sensitive), 0 otherwise
142142

143143
### Text similarity
144144

145-
To evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
145+
To evaluate how close the model-generated output is to the reference, scored with various evaluation metrics.
146146

147147
**Specification:**
148148

@@ -169,7 +169,7 @@ To evaluate how close the model-generated output is to the reference,scored with
169169

170170
This is Model Grader where you can use LLM to grade the training output.
171171

172-
Models which we are supporting as grader models are:
172+
Models which we're supporting as grader models are:
173173

174174
- `gpt-4o-2024-08-06`
175175
- `o3-mini-2025-01-31`
@@ -192,7 +192,7 @@ Models which we are supporting as grader models are:
192192
}
193193
```
194194

195-
To use a score model grader, the input is a list of chat messages, each containing a role and content. The output of the grader will be truncated to the given range, and default to 0 for all non-numeric outputs.
195+
To use a score model grader, the input is a list of chat messages, each containing a role, and content. The output of the grader will be truncated to the given range, and default to 0 for all non-numeric outputs.
196196

197197
### Multi Grader
198198

@@ -226,14 +226,14 @@ A multigrader object combines the output of multiple graders to produce a single
226226
- `sqrt`
227227
- `log`
228228

229-
When using the UX you are able to write a prompt and generate a valid grader and response format in json as needed.
229+
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed.
230230

231231
> [!IMPORTANT]
232-
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or do not create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
232+
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or don't create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
233233
234234
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/grader-schema.png" alt-text="Screenshot of the reinforcement fine-tuning grader schema generation experience." lightbox="../media/how-to/reinforcement-fine-tuning/grader-schema.png":::
235235

236-
Here is an example grader for each category:
236+
Here's an example grader for each category:
237237

238238
**string-check-grader** - use simple string operations to return a 0 or 1.
239239

@@ -249,7 +249,7 @@ Here is an example grader for each category:
249249
}
250250
```
251251

252-
**Text similarity** - Evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
252+
**Text similarity** - Evaluate how close the model-generated output is to the reference, scored with various evaluation metrics.
253253

254254
```json
255255
{
@@ -262,7 +262,7 @@ Here is an example grader for each category:
262262

263263
**Score Model** - This is Model Grader where you can use LLM to grade the training output.
264264

265-
Models which we are supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
265+
Models which we're supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
266266

267267
```json
268268
0.5 if one is the same, and 0.0 if neither are the same. Return just a floating point score\n\n Reference answer: {\u0022donors\u0022: {{item.reference_answer.donors}}, \u0022acceptors\u0022: {{item.reference_answer.acceptors}}}\n\n Model answer: {{sample.output_text}}"
@@ -303,18 +303,18 @@ Models which we are supporting as grader models are `gpt-4o-2024-08-06`and `o3-m
303303

304304
## Training progress and results
305305

306-
RFT jobs are typically long running, and may take up to 24 hours depending on your parameter selection. You can track progress in both fine-tuning views of the AI Foundry portal. You will see your job go through the same statuses as normal fine tuning jobs (queued, running, succeeded).
306+
RFT jobs are typically long running, and may take up to 24 hours depending on your parameter selection. You can track progress in both fine-tuning views of the AI Foundry portal. You'll see your job go through the same statuses as normal fine tuning jobs (queued, running, succeeded).
307307

308308
You can also review the results files while training runs, to get a peak at the progress and whether or not your training is proceeding as expected.
309309

310310
**New feature: pause and resume**
311311

312-
During the training you can view the logs and RFT metrics and pause the job as needed ( if metrics are not converging or if you feel model is not learning at the right pace, incorrect grader chosen etc.).
312+
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.).
313313

314314
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/pause.png" alt-text="Screenshot of the reinforcement fine-tuning with a running job." lightbox="../media/how-to/reinforcement-fine-tuning/pause.png":::
315315

316316

317-
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least 1 step and are in *Running* state.
317+
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
318318

319319
## Interpreting training results
320320

@@ -324,12 +324,12 @@ For reinforcement fine-tuning jobs, the primary metrics are the per-step reward
324324

325325
These are two separate top-level reward metrics:
326326

327-
- `train_reward_mean`: The average reward across the samples taken from all datapoints in the current step. Because the specific datapoints in a batch change with each step, train_reward_mean values across different steps are not directly comparable and the specific values can fluctuate drastically from step to step.
327+
- `train_reward_mean`: The average reward across the samples taken from all datapoints in the current step. Because the specific datapoints in a batch change with each step, train_reward_mean values across different steps aren't directly comparable and the specific values can fluctuate drastically from step to step.
328328

329329
- `valid_reward_mean`: The average reward across the samples taken from all datapoints in the validation set, which is a more stable metric.
330330

331331
> [!TIP]
332-
> You should always test inferencing with your model. If you’ve selected an inappropriate grader, it’s possible that the mean reward does not reflect the model’s performance. Review sample outputs from the model to ensure they are formatted correctly and make sense. Check if the model's predictions align with the ground truth and if the descriptive analysis provides a reasonable explanation.
332+
> You should always test inferencing with your model. If you’ve selected an inappropriate grader, it’s possible that the mean reward doesn't reflect the model’s performance. Review sample outputs from the model to ensure they're formatted correctly and make sense. Check if the model's predictions align with the ground truth and if the descriptive analysis provides a reasonable explanation.
333333
334334
### Reasoning tokens
335335

@@ -348,15 +348,15 @@ Understanding the model's behavior can be done quickly by inspecting the evals a
348348

349349
Your fine tuned model can be deployed via the UI or REST API, just like any other fine tuned model.
350350

351-
You can deploy the fine tuning job which is completed or any intermittent checkpoints created automatically or manually by triggering pause operation. To know more about model deployment and test with Chat Playground refer, see [fine-tunig deployment](./fine-tuning-deploy.md).
351+
You can deploy the fine tuning job which is completed or any intermittent checkpoints created automatically or manually by triggering pause operation. To know more about model deployment and test with Chat Playground refer, see [fine-tuning deployment](./fine-tuning-deploy.md).
352352

353353
When using your model, make sure to use the same instructions and structure as used during training. This keeps the model in distribution, and ensures that you see the same performance on your problems during inference as you achieved during training.
354354

355355
## Best practices
356356

357357
### Grader selection
358358

359-
Your graders are used for reinforcement learning: choosing the wrong grader means that your rewards will be invalid, and your fine tuning will not produce the expected results.
359+
Your graders are used for reinforcement learning: choosing the wrong grader means that your rewards will be invalid, and your fine tuning won't produce the expected results.
360360

361361
Some basic rules for grader selection:
362362

@@ -372,18 +372,18 @@ Some basic rules for grader selection:
372372

373373
### Test your graders
374374

375-
All of the graders available in RFT are supported in [Azure OpenAI evaluation](./evaluations.md). Before initiating a training run, test a vanilla o4-mini model against your validation data with the same grader you intend to use for training. If the grader scores do not match your expectations – then you need to select a different grader.
375+
All of the graders available in RFT are supported in [Azure OpenAI evaluation](./evaluations.md). Before initiating a training run, test a vanilla o4-mini model against your validation data with the same grader you intend to use for training. If the grader scores don't match your expectations – then you need to select a different grader.
376376

377377
We also provide a grader check API that you can use to check the validity of your configuration.
378378

379379
### Data preparation
380380

381-
Aim for a few hundred examples initially and consider scaling up to around 1000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
381+
Aim for a few hundred examples initially and consider scaling up to around 1,000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
382382

383-
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc).
383+
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)
384384

385-
## RFT Spend limits
385+
## RFT spending limits
386386

387-
As RFT job can lead to high training costs, we are capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
387+
As RFT job can lead to high training costs, we're capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
388388

389389
Users can validate the training job, metrics, logs and then decide to resume the job to complete further. If the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.

0 commit comments

Comments
 (0)