You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: articles/ai-services/openai/how-to/reinforcement-fine-tuning.md
+68-50Lines changed: 68 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -30,44 +30,38 @@ However, despite these differences, there are many commonalities between SFT and
30
30
31
31
## Training & evaluation file formation requirements
32
32
33
-
Both training and validation files are required to run o4-mini RFT. o4-mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
33
+
Both training and validation files are required to run o4-mini RFT. o4-mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
34
34
35
35
Each line of the file should contain the messages field, with some differences to SFT:
36
36
37
37
- System messages aren't supported
38
38
- The final message must be from the user, not the assistant (as is the case for SFT)
39
-
-`Tools`, `functions`, `response_formats`, are supported
39
+
-`Tools`, `response_formats`, are supported
40
40
- Images / multimodal data aren't supported
41
41
42
-
Each line must include a new field called reference answer
43
-
44
-
-`Reference_answer` contains the data used by your grader to determine the correctness of the answer.
45
-
- This value must be a valid JSON object (for example, dictionary, or list; the specific type and structure is dependent on your selected grader).
46
-
47
-
We currently support a maximum of 50,000 reference examples for training.
42
+
Each line in the JSONL data file should contain a messages array, along with any additional fields required to grade the output from the model. This value must be a valid JSON object (for example, dictionary or list; the specific type and structure is dependent on your selected grader).
48
43
49
44
### Example training data
50
45
51
-
```jsonl
52
-
{"messages": [{"role": "user", "content": "Your task is to calculate the results for this math problem based on BODMAS rules. Please analyze the following math problem: 36453-1238 + 25*5"}], "reference_answer": {“Result”: 35090}}
53
-
```
54
-
55
-
Expanding the above text from a single line, you can see the expected fields: messages, role, content, and reference answer:
46
+
If we give model a puzzle to solve in the required RFT training format it would be as follows:
56
47
57
48
```json
58
-
{
59
-
"messages": [
49
+
"messages": [
60
50
{
61
51
"role": "user",
62
-
"content": "Your task is to calculate the results for this math problem based on BODMAS rules. Please analyze the following math problem: 36453-1238 + 25*5"
52
+
"content": "You are a helpful assistant. Your task is to solve the following logic and puzzle quiz:\n\n2. In the expression 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 = 100, replace the asterisks with arithmetic operation signs to obtain a correct equation."
We have expanded the above text from a single line of `jsonl`, so you can see the expected fields: messages, role, content, and `final_answer`.
64
+
71
65
### Dataset size for RFT
72
66
73
67
Begin with a small dataset, comprising several dozen to a few hundred examples, to evaluate the efficacy of RFT prior to committing to a larger dataset. For safety reasons, the training set must undergo an automated screening process, which initiates when a fine-tuning job is started, rather than upon file upload. Once a file has successfully passed screening, it can be utilized repeatedly without delay.
@@ -92,7 +86,7 @@ The hyperparameters section of the **reinforcement** method supports all of the
92
86
|----|----|----|
93
87
|`Eval_samples`: |1-10 | The number of samples to use during evaluation. Validation split reward metrics will be averaged across the different samples for each datapoint. Default is 5.|
94
88
|`Eval_interval`|1-25 | The number of training steps between evaluations over a provided validation file. Default is 1.|
95
-
|`Compute-multiplier`|0.125 -8.0 | The multiplier on amount of compute use for exploring search space during training. Increasing will result in greater number of samples being rolled per instance. Too low likely to underfit, too high would be prone to overfit. Default is 10.|
89
+
|`Compute-multiplier`|0.5 -3.0 | The multiplier on amount of compute use for exploring search space during training. Increasing will result in greater number of samples being rolled per instance. Too low likely to underfit, too high would be prone to overfit.|
96
90
|`Reasoning_effort`|Low, Medium, High | The amount of effort the model should put into reasoning. Defaults to medium effort. If performance is poor, consider increasing the reasoning effort. |
97
91
98
92
> [!NOTE]
@@ -111,7 +105,6 @@ Effectively, graders are functions that compare the reference_answer from your t
111
105
112
106
- Return floating point numbers between 0 and 1. It can be helpful to give the model partial credit for answers, rather than binary 0/1.
113
107
- Graders are specified as JSON (see below)
114
-
- We only support simple comparisons at this time (for example, between strings or JSON objects).
115
108
116
109
### Supported graders
117
110
@@ -216,7 +209,7 @@ A multigrader object combines the output of multiple graders to produce a single
216
209
-`/` (division)
217
210
-`^` (power)
218
211
219
-
*Functions:*
212
+
*Functions:*0
220
213
-`min`
221
214
-`max`
222
215
-`abs`
@@ -226,7 +219,7 @@ A multigrader object combines the output of multiple graders to produce a single
226
219
-`sqrt`
227
220
-`log`
228
221
229
-
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed.
222
+
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed. Grader is mandatory field to be entered while submitting a finetuning job. Response format is optional.
230
223
231
224
> [!IMPORTANT]
232
225
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or don't create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
@@ -241,7 +234,7 @@ Here's an example grader for each category:
241
234
242
235
```json
243
236
{
244
-
"name": "simpleadd_ans_grader",
237
+
"name": "string_check_sample_grader",
245
238
"type": "string_check",
246
239
"input": "{{item.reference_answer}}",
247
240
"reference": "{{sample.output_text}}",
@@ -265,39 +258,71 @@ Here's an example grader for each category:
265
258
Models which we're supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
266
259
267
260
```json
268
-
0.5if one is the same, and 0.0 if neither are the same. Return just a floating point score\n\n Reference answer: {\u0022donors\u0022: {{item.reference_answer.donors}}, \u0022acceptors\u0022: {{item.reference_answer.acceptors}}}\n\n Model answer: {{sample.output_text}}"
269
-
}
270
-
],
271
-
"model": "gpt-4o",
272
-
"sampling_params": {
273
-
"seed": 1,
274
-
"temperature": 1,
275
-
"max_completions_tokens": 1000,
276
-
"top_p": 1
277
-
},
278
-
"name": "grader-test"
261
+
{
262
+
"name": "score_model_sample_grader",
263
+
"type": "score_model",
264
+
265
+
"input": [ {
266
+
"role": "user",
267
+
"content": "Score\nhow close the reference answer is to the model answer. You will be comparing these\ntwo as JSON objects that contain 2 keys, \"extracted_text\" and\n\"clause_type\". Score 1.0 if they are both the same, 0.5 if one is\nthe same, and 0.0 if neither are the same. Return just a floating point\nscore\n\n Reference answer: {\"extracted_text\": \n{{item.extracted_text}}, \"clause_type\": {{item.clause_type}}}\n\n\nModel answer: {{sample.output_json}}"}],
268
+
269
+
"model": "gpt-4o-2024-08-06",
270
+
"sampling_params": {"seed": 42}
271
+
}
279
272
```
280
273
281
274
**Multi Grader** - A multigrader object combines the output of multiple graders to produce a single score.
> : Currently we don’t support `multi` with model grader as a sub grader. `Multi` grader is supported only with `text_Similarity` and `string_check`.
300
298
299
+
Example of response format which is an optional field:
300
+
301
+
If we need the response for the same puzzles problem used in training data example then can add the response format as shown below where fields ‘solution’ and ‘final answer’ are shared in structured outputs.
302
+
303
+
```json
304
+
{
305
+
"type": "json_schema",
306
+
"name": "puzzles_assistant",
307
+
"schema": {
308
+
"type" : "object",
309
+
"properties": {
310
+
"solution": {
311
+
"type": "string",
312
+
"title": "solution"
313
+
},
314
+
"final_answer": {
315
+
"type": "string",
316
+
"title": "final_answer"
317
+
}
318
+
},
319
+
"required": [
320
+
"solution",
321
+
"final_answer"
322
+
],
323
+
"additionalProperties": false
324
+
},
325
+
"strict": true
301
326
}
302
327
```
303
328
@@ -309,12 +334,13 @@ You can also review the results files while training runs, to get a peak at the
309
334
310
335
**New feature: pause and resume**
311
336
312
-
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.).
337
+
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.). Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
313
338
314
339
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/pause.png" alt-text="Screenshot of the reinforcement fine-tuning with a running job." lightbox="../media/how-to/reinforcement-fine-tuning/pause.png":::
315
340
341
+
### Guardrails on training spending
316
342
317
-
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
343
+
As a RFT job can lead to high training costs, we automatically pause jobs once they have hit $5K in total training costs (training + grading). Users may deploy the most recent checkpoint or resume the training job. If the the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.
318
344
319
345
## Interpreting training results
320
346
@@ -340,7 +366,7 @@ The `train_reasoning_tokens_mean` and `valid_reasoning_tokens_mean` metrics to s
340
366
341
367
## Evaluate the results
342
368
343
-
By the time your fine-tuning job finishes, you should have a decent idea of how well the model is performing based on the mean reward value on the validation set. However, it's possible that the model has either overfit to the training data or has learned to reward hack your grader, which allows it to produce high scores without actually being correct. Before deploying your model, inspect its behavior on a representative set of prompts to ensure it behaves how you expect.
369
+
By the time your fine-tuning job finishes, you should have a decent idea of how well the model is performing based on the mean reward value on the validation set. However, it's possible that the model has either overfit to the training data or has learned to reward hack your grader, which allows it to produce high scores without actually being correct.
344
370
345
371
Understanding the model's behavior can be done quickly by inspecting the evals associated with the fine-tuning job. Specifically, pay close attention to the run made for the final training step to see the end model's behavior. You can also use the evals product to compare the final run to earlier runs and see how the model's behavior has changed over the course of training.
346
372
@@ -364,8 +390,6 @@ Some basic rules for grader selection:
364
390
365
391
- If you have **complex responses that can be scored on multiple criteria, use multi graders**. This allows you to score different aspects of the response and combine it into an aggregate.
366
392
367
-
-**Guard against reward hacking**. This happens when the model finds a shortcut that earns high scores without real skill. Make it hard to loophole your grading system.
368
-
369
393
-**Consider breaking the grader down into multiple steps**, and giving partial credit, to nudge the models reasoning in the right direction, grading is stable and aligned with preference. Provide few-shot examples of great, fair, and poor answers in the prompt.
370
394
371
395
-**Use an LLM as a-judge when code falls short**. For rich, open ended answers, ask another language model to grade. When building LLM graders, run multiple candidate responses and ground truths through your LLM judge to ensure
@@ -380,10 +404,4 @@ We also provide a grader check API that you can use to check the validity of you
380
404
381
405
Aim for a few hundred examples initially and consider scaling up to around 1,000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
382
406
383
-
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)
384
-
385
-
## RFT spending limits
386
-
387
-
As RFT job can lead to high training costs, we're capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
388
-
389
-
Users can validate the training job, metrics, logs and then decide to resume the job to complete further. If the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.
407
+
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)
0 commit comments