You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/reinforcement-fine-tuning.md
+34-34Lines changed: 34 additions & 34 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -12,13 +12,13 @@ ms.author: mbullwin
12
12
13
13
# Reinforcement fine-tuning (RFT) with Azure OpenAI o4-mini
14
14
15
-
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models like o4-mini by training them through a reward-based process, rather than relying only on labelled data. By using feedback or "rewards" to guide learning, RFT helps models develop better reasoning and problem-solving skills, especially in cases where labelled examples are limited or complex behaviors are desired.
15
+
Reinforcement fine-tuning (RFT) is a technique for improving reasoning models like o4-mini by training them through a reward-based process, rather than relying only on labeled data. By using feedback or "rewards" to guide learning, RFT helps models develop better reasoning and problem-solving skills, especially in cases where labeled examples are limited or complex behaviors are desired.
16
16
17
17
## Process
18
18
19
19
The process of Reinforcement fine-tuning (RFT) is similar to Supervised fine-tuning (SFT) with some notable differences:
20
20
21
-
-**Data preparation** system messages are not supported, and instead of an `assistant` message, the final message in your training data is a reference answer.
21
+
-**Data preparation** system messages aren't supported, and instead of an `assistant` message, the final message in your training data is a reference answer.
22
22
-**Model selection:** only o4-mini supports RFT.
23
23
-**Grader definition:** RFT requires the use of **graders** to score the quality of your fine-tuned model and guide learning. You can use string check, text similarity, or model-based graders – or combine them with a multi-grader.
24
24
-**Training:** includes additional parameters: `eval_samples`, `eval_interval`, `reasoning_effort`, and `compute_multiplier`. You also have the option to pause and resume jobs, allowing you to pause training, inspect checkpoints, and only continue if further training is needed.
@@ -30,19 +30,19 @@ However, despite these differences, there are many commonalities between SFT and
30
30
31
31
## Training & evaluation file formation requirements
32
32
33
-
Both training and validation files are required to run o4-mini RFT. o4 -mini uses a new format of data for reinforcement finetuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
33
+
Both training and validation files are required to run o4-mini RFT. o4 -mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
34
34
35
35
Each line of the file should contain the messages field, with some differences to SFT:
36
36
37
-
- System messages are not supported
37
+
- System messages aren't supported
38
38
- The final message must be from the user, not the assistant (as is the case for SFT)
39
39
-`Tools`, `functions`, `response_formats`, are supported
40
-
- Images / multimodal data are not supported
40
+
- Images / multimodal data aren't supported
41
41
42
42
Each line must include a new field called reference answer
43
43
44
44
-`Reference_answer` contains the data used by your grader to determine the correctness of the answer.
45
-
- This value must be a valid JSON object (fore example, dictionary or list; the specific type and structure is dependent on your selected grader).
45
+
- This value must be a valid JSON object (for example, dictionary, or list; the specific type and structure is dependent on your selected grader).
46
46
47
47
We currently support a maximum of 50,000 reference examples for training.
48
48
@@ -103,7 +103,7 @@ The hyperparameters section of the **reinforcement** method supports all of the
103
103
104
104
## Graders
105
105
106
-
RFT is unique because it uses graders to assess the quality of a model’s response to teach the model to reason. Unlike SFT, the final message is not from the assistant – instead we sample the model and use a grader on each sample to score its quality. We then train based on those scores to improve model performance.
106
+
RFT is unique because it uses graders to assess the quality of a model’s response to teach the model to reason. Unlike SFT, the final message isn't from the assistant – instead we sample the model and use a grader on each sample to score its quality. We then train based on those scores to improve model performance.
107
107
108
108
Effectively, graders are functions that compare the reference_answer from your training file with the sampled response.
109
109
@@ -115,11 +115,11 @@ Effectively, graders are functions that compare the reference_answer from your t
115
115
116
116
### Supported graders
117
117
118
-
We support three types of graders: String check ,Text similarity, Model graders. There is also an option on Multi-graders to use graders in combinations.
118
+
We support three types of graders: String check, Text similarity, Model graders. There's also an option on Multi-graders to use graders in combinations.
119
119
120
-
### string-check-grader
120
+
### String-check-grader
121
121
122
-
Use these simple string operations to return a 0 or 1.
122
+
Use these basic string operations to return a `0` or `1`.
123
123
124
124
**Specification:**
125
125
@@ -135,14 +135,14 @@ Use these simple string operations to return a 0 or 1.
135
135
136
136
**Supported operations:**
137
137
138
-
-eq: Returns 1 if the input matches the reference (case-sensitive), 0 otherwise
139
-
- neq: Returns 1 if the input does not match the reference (case-sensitive), 0 otherwise
140
-
- like: Returns 1 if the input contains the reference (case-sensitive), 0 otherwise
141
-
- ilike: Returns 1 if the input contains the reference (not case-sensitive), 0 otherwise
138
+
-`eq`: Returns 1 if the input matches the reference (case-sensitive), 0 otherwise
139
+
-`neq`: Returns 1 if the input doesn't match the reference (case-sensitive), 0 otherwise
140
+
-`like`: Returns 1 if the input contains the reference (case-sensitive), 0 otherwise
141
+
-`ilike`: Returns 1 if the input contains the reference (not case-sensitive), 0 otherwise
142
142
143
143
### Text similarity
144
144
145
-
To evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
145
+
To evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
146
146
147
147
**Specification:**
148
148
@@ -169,7 +169,7 @@ To evaluate how close the model-generated output is to the reference,scored with
169
169
170
170
This is Model Grader where you can use LLM to grade the training output.
171
171
172
-
Models which we are supporting as grader models are:
172
+
Models which we're supporting as grader models are:
173
173
174
174
-`gpt-4o-2024-08-06`
175
175
-`o3-mini-2025-01-31`
@@ -192,7 +192,7 @@ Models which we are supporting as grader models are:
192
192
}
193
193
```
194
194
195
-
To use a score model grader, the input is a list of chat messages, each containing a role and content. The output of the grader will be truncated to the given range, and default to 0 for all non-numeric outputs.
195
+
To use a score model grader, the input is a list of chat messages, each containing a role, and content. The output of the grader will be truncated to the given range, and default to 0 for all non-numeric outputs.
196
196
197
197
### Multi Grader
198
198
@@ -226,14 +226,14 @@ A multigrader object combines the output of multiple graders to produce a single
226
226
-`sqrt`
227
227
-`log`
228
228
229
-
When using the UX you are able to write a prompt and generate a valid grader and response format in json as needed.
229
+
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed.
230
230
231
231
> [!IMPORTANT]
232
-
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or do not create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
232
+
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or don't create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
233
233
234
234
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/grader-schema.png" alt-text="Screenshot of the reinforcement fine-tuning grader schema generation experience." lightbox="../media/how-to/reinforcement-fine-tuning/grader-schema.png":::
235
235
236
-
Here is an example grader for each category:
236
+
Here's an example grader for each category:
237
237
238
238
**string-check-grader** - use simple string operations to return a 0 or 1.
239
239
@@ -249,7 +249,7 @@ Here is an example grader for each category:
249
249
}
250
250
```
251
251
252
-
**Text similarity** - Evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
252
+
**Text similarity** - Evaluate how close the model-generated output is to the reference,scored with various evaluation metrics.
253
253
254
254
```json
255
255
{
@@ -262,7 +262,7 @@ Here is an example grader for each category:
262
262
263
263
**Score Model** - This is Model Grader where you can use LLM to grade the training output.
264
264
265
-
Models which we are supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
265
+
Models which we're supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
266
266
267
267
```json
268
268
0.5if one is the same, and 0.0 if neither are the same. Return just a floating point score\n\n Reference answer: {\u0022donors\u0022: {{item.reference_answer.donors}}, \u0022acceptors\u0022: {{item.reference_answer.acceptors}}}\n\n Model answer: {{sample.output_text}}"
@@ -303,18 +303,18 @@ Models which we are supporting as grader models are `gpt-4o-2024-08-06`and `o3-m
303
303
304
304
## Training progress and results
305
305
306
-
RFT jobs are typically long running, and may take up to 24 hours depending on your parameter selection. You can track progress in both fine-tuning views of the AI Foundry portal. You will see your job go through the same statuses as normal fine tuning jobs (queued, running, succeeded).
306
+
RFT jobs are typically long running, and may take up to 24 hours depending on your parameter selection. You can track progress in both fine-tuning views of the AI Foundry portal. You'll see your job go through the same statuses as normal fine tuning jobs (queued, running, succeeded).
307
307
308
308
You can also review the results files while training runs, to get a peak at the progress and whether or not your training is proceeding as expected.
309
309
310
310
**New feature: pause and resume**
311
311
312
-
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics are not converging or if you feel model is not learning at the right pace, incorrect grader chosen etc.).
312
+
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.).
313
313
314
314
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/pause.png" alt-text="Screenshot of the reinforcement fine-tuning with a running job." lightbox="../media/how-to/reinforcement-fine-tuning/pause.png":::
315
315
316
316
317
-
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least 1 step and are in *Running* state.
317
+
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
318
318
319
319
## Interpreting training results
320
320
@@ -324,12 +324,12 @@ For reinforcement fine-tuning jobs, the primary metrics are the per-step reward
324
324
325
325
These are two separate top-level reward metrics:
326
326
327
-
-`train_reward_mean`: The average reward across the samples taken from all datapoints in the current step. Because the specific datapoints in a batch change with each step, train_reward_mean values across different steps are not directly comparable and the specific values can fluctuate drastically from step to step.
327
+
-`train_reward_mean`: The average reward across the samples taken from all datapoints in the current step. Because the specific datapoints in a batch change with each step, train_reward_mean values across different steps aren't directly comparable and the specific values can fluctuate drastically from step to step.
328
328
329
329
-`valid_reward_mean`: The average reward across the samples taken from all datapoints in the validation set, which is a more stable metric.
330
330
331
331
> [!TIP]
332
-
> You should always test inferencing with your model. If you’ve selected an inappropriate grader, it’s possible that the mean reward does not reflect the model’s performance. Review sample outputs from the model to ensure they are formatted correctly and make sense. Check if the model's predictions align with the ground truth and if the descriptive analysis provides a reasonable explanation.
332
+
> You should always test inferencing with your model. If you’ve selected an inappropriate grader, it’s possible that the mean reward doesn't reflect the model’s performance. Review sample outputs from the model to ensure they're formatted correctly and make sense. Check if the model's predictions align with the ground truth and if the descriptive analysis provides a reasonable explanation.
333
333
334
334
### Reasoning tokens
335
335
@@ -348,15 +348,15 @@ Understanding the model's behavior can be done quickly by inspecting the evals a
348
348
349
349
Your fine tuned model can be deployed via the UI or REST API, just like any other fine tuned model.
350
350
351
-
You can deploy the fine tuning job which is completed or any intermittent checkpoints created automatically or manually by triggering pause operation. To know more about model deployment and test with Chat Playground refer, see [fine-tunig deployment](./fine-tuning-deploy.md).
351
+
You can deploy the fine tuning job which is completed or any intermittent checkpoints created automatically or manually by triggering pause operation. To know more about model deployment and test with Chat Playground refer, see [fine-tuning deployment](./fine-tuning-deploy.md).
352
352
353
353
When using your model, make sure to use the same instructions and structure as used during training. This keeps the model in distribution, and ensures that you see the same performance on your problems during inference as you achieved during training.
354
354
355
355
## Best practices
356
356
357
357
### Grader selection
358
358
359
-
Your graders are used for reinforcement learning: choosing the wrong grader means that your rewards will be invalid, and your fine tuning will not produce the expected results.
359
+
Your graders are used for reinforcement learning: choosing the wrong grader means that your rewards will be invalid, and your fine tuning won't produce the expected results.
360
360
361
361
Some basic rules for grader selection:
362
362
@@ -372,18 +372,18 @@ Some basic rules for grader selection:
372
372
373
373
### Test your graders
374
374
375
-
All of the graders available in RFT are supported in [Azure OpenAI evaluation](./evaluations.md). Before initiating a training run, test a vanilla o4-mini model against your validation data with the same grader you intend to use for training. If the grader scores do not match your expectations – then you need to select a different grader.
375
+
All of the graders available in RFT are supported in [Azure OpenAI evaluation](./evaluations.md). Before initiating a training run, test a vanilla o4-mini model against your validation data with the same grader you intend to use for training. If the grader scores don't match your expectations – then you need to select a different grader.
376
376
377
377
We also provide a grader check API that you can use to check the validity of your configuration.
378
378
379
379
### Data preparation
380
380
381
-
Aim for a few hundred examples initially and consider scaling up to around 1000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
381
+
Aim for a few hundred examples initially and consider scaling up to around 1,000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
382
382
383
-
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc).
383
+
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)
384
384
385
-
## RFT Spend limits
385
+
## RFT spending limits
386
386
387
-
As RFT job can lead to high training costs, we are capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
387
+
As RFT job can lead to high training costs, we're capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
388
388
389
389
Users can validate the training job, metrics, logs and then decide to resume the job to complete further. If the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.
0 commit comments