Skip to content

Commit 15c1cdd

Browse files
committed
update
1 parent f29ad9f commit 15c1cdd

File tree

1 file changed

+68
-50
lines changed

1 file changed

+68
-50
lines changed

articles/ai-services/openai/how-to/reinforcement-fine-tuning.md

Lines changed: 68 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -30,44 +30,38 @@ However, despite these differences, there are many commonalities between SFT and
3030

3131
## Training & evaluation file formation requirements
3232

33-
Both training and validation files are required to run o4-mini RFT. o4 -mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
33+
Both training and validation files are required to run o4-mini RFT. o4-mini uses a new format of data for reinforcement fine-tuning. These should be jsonl files, like what is used for supervised fine tuning (SFT).
3434

3535
Each line of the file should contain the messages field, with some differences to SFT:
3636

3737
- System messages aren't supported
3838
- The final message must be from the user, not the assistant (as is the case for SFT)
39-
- `Tools`, `functions`, `response_formats`, are supported
39+
- `Tools`, `response_formats`, are supported
4040
- Images / multimodal data aren't supported
4141

42-
Each line must include a new field called reference answer
43-
44-
- `Reference_answer` contains the data used by your grader to determine the correctness of the answer.
45-
- This value must be a valid JSON object (for example, dictionary, or list; the specific type and structure is dependent on your selected grader).
46-
47-
We currently support a maximum of 50,000 reference examples for training.
42+
Each line in the JSONL data file should contain a messages array, along with any additional fields required to grade the output from the model. This value must be a valid JSON object (for example, dictionary or list; the specific type and structure is dependent on your selected grader).
4843

4944
### Example training data
5045

51-
```jsonl
52-
{"messages": [{"role": "user", "content": "Your task is to calculate the results for this math problem based on BODMAS rules. Please analyze the following math problem: 36453-1238 + 25*5"}], "reference_answer": {“Result”: 35090}}
53-
```
54-
55-
Expanding the above text from a single line, you can see the expected fields: messages, role, content, and reference answer:
46+
If we give model a puzzle to solve in the required RFT training format it would be as follows:
5647

5748
```json
58-
{
59-
"messages": [
49+
"messages": [
6050
{
6151
"role": "user",
62-
"content": "Your task is to calculate the results for this math problem based on BODMAS rules. Please analyze the following math problem: 36453-1238 + 25*5"
52+
"content": "You are a helpful assistant. Your task is to solve the following logic and puzzle quiz:\n\n2. In the expression 1 * 2 * 3 * 4 * 5 * 6 * 7 * 8 * 9 = 100, replace the asterisks with arithmetic operation signs to obtain a correct equation."
6353
}
6454
],
65-
"reference_answer": {
66-
"Result": 35090
67-
}
55+
56+
"solution": "Solution. 1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 \\cdot 9 = 100.\n\nEvaluation. 12 points for the correct solution.",
57+
58+
"final_answer": "1 + 2 + 3 + 4 + 5 + 6 + 7 + 8 \\cdot 9 = 100"
6859
}
60+
6961
```
7062

63+
We have expanded the above text from a single line of `jsonl`, so you can see the expected fields: messages, role, content, and `final_answer`.
64+
7165
### Dataset size for RFT
7266

7367
Begin with a small dataset, comprising several dozen to a few hundred examples, to evaluate the efficacy of RFT prior to committing to a larger dataset. For safety reasons, the training set must undergo an automated screening process, which initiates when a fine-tuning job is started, rather than upon file upload. Once a file has successfully passed screening, it can be utilized repeatedly without delay.
@@ -92,7 +86,7 @@ The hyperparameters section of the **reinforcement** method supports all of the
9286
|----|----|----|
9387
|`Eval_samples`: |1-10 | The number of samples to use during evaluation. Validation split reward metrics will be averaged across the different samples for each datapoint. Default is 5.|
9488
|`Eval_interval` |1-25 | The number of training steps between evaluations over a provided validation file. Default is 1.|
95-
|`Compute-multiplier` |0.125 -8.0 | The multiplier on amount of compute use for exploring search space during training. Increasing will result in greater number of samples being rolled per instance. Too low likely to underfit, too high would be prone to overfit. Default is 10.|
89+
|`Compute-multiplier` |0.5 -3.0 | The multiplier on amount of compute use for exploring search space during training. Increasing will result in greater number of samples being rolled per instance. Too low likely to underfit, too high would be prone to overfit.|
9690
|`Reasoning_effort`|Low, Medium, High | The amount of effort the model should put into reasoning. Defaults to medium effort. If performance is poor, consider increasing the reasoning effort. |
9791

9892
> [!NOTE]
@@ -111,7 +105,6 @@ Effectively, graders are functions that compare the reference_answer from your t
111105

112106
- Return floating point numbers between 0 and 1. It can be helpful to give the model partial credit for answers, rather than binary 0/1.
113107
- Graders are specified as JSON (see below)
114-
- We only support simple comparisons at this time (for example, between strings or JSON objects).
115108

116109
### Supported graders
117110

@@ -216,7 +209,7 @@ A multigrader object combines the output of multiple graders to produce a single
216209
- `/` (division)
217210
- `^` (power)
218211

219-
*Functions:*
212+
*Functions:*0
220213
- `min`
221214
- `max`
222215
- `abs`
@@ -226,7 +219,7 @@ A multigrader object combines the output of multiple graders to produce a single
226219
- `sqrt`
227220
- `log`
228221

229-
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed.
222+
When using the UX you're able to write a prompt and generate a valid grader and response format in json as needed. Grader is mandatory field to be entered while submitting a finetuning job. Response format is optional.
230223

231224
> [!IMPORTANT]
232225
> Generating correct grader schema requires careful prompt authoring. You may find that your first few attempts generate invalid schemas or don't create a schema that will properly handle your training data. Grader is a mandatory field that must be entered while submitting a fine-tuning job. Response format is optional.
@@ -241,7 +234,7 @@ Here's an example grader for each category:
241234

242235
```json
243236
{
244-
"name": "simpleadd_ans_grader",
237+
"name": "string_check_sample_grader",
245238
"type": "string_check",
246239
"input": "{{item.reference_answer}}",
247240
"reference": "{{sample.output_text}}",
@@ -265,39 +258,71 @@ Here's an example grader for each category:
265258
Models which we're supporting as grader models are `gpt-4o-2024-08-06`and `o3-mini-2025-01-31`.
266259

267260
```json
268-
0.5 if one is the same, and 0.0 if neither are the same. Return just a floating point score\n\n Reference answer: {\u0022donors\u0022: {{item.reference_answer.donors}}, \u0022acceptors\u0022: {{item.reference_answer.acceptors}}}\n\n Model answer: {{sample.output_text}}"
269-
}
270-
],
271-
"model": "gpt-4o",
272-
"sampling_params": {
273-
"seed": 1,
274-
"temperature": 1,
275-
"max_completions_tokens": 1000,
276-
"top_p": 1
277-
},
278-
"name": "grader-test"
261+
{
262+
"name": "score_model_sample_grader",
263+
"type": "score_model",
264+
265+
"input": [ {
266+
"role": "user",
267+
"content": "Score\nhow close the reference answer is to the model answer. You will be comparing these\ntwo as JSON objects that contain 2 keys, \"extracted_text\" and\n\"clause_type\". Score 1.0 if they are both the same, 0.5 if one is\nthe same, and 0.0 if neither are the same. Return just a floating point\nscore\n\n Reference answer: {\"extracted_text\": \n{{item.extracted_text}}, \"clause_type\": {{item.clause_type}}}\n\n\nModel answer: {{sample.output_json}}"}],
268+
269+
"model": "gpt-4o-2024-08-06",
270+
"sampling_params": {"seed": 42}
271+
}
279272
```
280273

281274
**Multi Grader** - A multigrader object combines the output of multiple graders to produce a single score.
282275

283276
```json
284277
{
285-
"name":"clause_match_grader",
278+
"name":"sample_multi_grader",
286279
"type":"multi",
287280
"graders":{"ext_text_similarity":{"name":"ext_text_similarity",
288281
"type":"text_similarity",
289282
"input":"{{sample.output_json.ext_text}}",
290-
"reference":"{{item.reference_answer.ext_text}}",
283+
"reference":"{{item.ext_text}}",
291284
"evaluation_metric":"fuzzy_match"},
292285

293286
"clause_string_check":{"name":"clause_string_check",
294287
"type":"string_check",
295288
"input":"{{sample.output_json.clause_type}}",
296289
"operation":"eq",
297-
"reference":"{{item.reference_answer.clause_type}}"}},
290+
"reference":"{{item.clause_type}}"}},
298291

299292
"calculate_output":"0.5 * ext_text_similarity + 0.5 * clause_string_check"
293+
}
294+
```
295+
296+
> [!Note]
297+
> : Currently we don’t support `multi` with model grader as a sub grader. `Multi` grader is supported only with `text_Similarity` and `string_check`.
300298
299+
Example of response format which is an optional field:
300+
301+
If we need the response for the same puzzles problem used in training data example then can add the response format as shown below where fields ‘solution’ and ‘final answer’ are shared in structured outputs.
302+
303+
```json
304+
{
305+
"type": "json_schema",
306+
"name": "puzzles_assistant",
307+
"schema": {
308+
"type" : "object",
309+
"properties": {
310+
"solution": {
311+
"type": "string",
312+
"title": "solution"
313+
},
314+
"final_answer": {
315+
"type": "string",
316+
"title": "final_answer"
317+
}
318+
},
319+
"required": [
320+
"solution",
321+
"final_answer"
322+
],
323+
"additionalProperties": false
324+
},
325+
"strict": true
301326
}
302327
```
303328

@@ -309,12 +334,13 @@ You can also review the results files while training runs, to get a peak at the
309334

310335
**New feature: pause and resume**
311336

312-
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.).
337+
During the training you can view the logs and RFT metrics and pause the job as needed (if metrics aren't converging or if you feel model isn't learning at the right pace, incorrect grader chosen, etc.). Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
313338

314339
:::image type="content" source="../media/how-to/reinforcement-fine-tuning/pause.png" alt-text="Screenshot of the reinforcement fine-tuning with a running job." lightbox="../media/how-to/reinforcement-fine-tuning/pause.png":::
315340

341+
### Guardrails on training spending
316342

317-
Once the training job is paused, a deployable checkpoint will be created and available for you to infer or resume the job further to completion. Pause operation is only applicable for jobs which have been trained for at least one step and are in *Running* state.
343+
As a RFT job can lead to high training costs, we automatically pause jobs once they have hit $5K in total training costs (training + grading). Users may deploy the most recent checkpoint or resume the training job. If the the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.
318344

319345
## Interpreting training results
320346

@@ -340,7 +366,7 @@ The `train_reasoning_tokens_mean` and `valid_reasoning_tokens_mean` metrics to s
340366
341367
## Evaluate the results
342368

343-
By the time your fine-tuning job finishes, you should have a decent idea of how well the model is performing based on the mean reward value on the validation set. However, it's possible that the model has either overfit to the training data or has learned to reward hack your grader, which allows it to produce high scores without actually being correct. Before deploying your model, inspect its behavior on a representative set of prompts to ensure it behaves how you expect.
369+
By the time your fine-tuning job finishes, you should have a decent idea of how well the model is performing based on the mean reward value on the validation set. However, it's possible that the model has either overfit to the training data or has learned to reward hack your grader, which allows it to produce high scores without actually being correct.
344370

345371
Understanding the model's behavior can be done quickly by inspecting the evals associated with the fine-tuning job. Specifically, pay close attention to the run made for the final training step to see the end model's behavior. You can also use the evals product to compare the final run to earlier runs and see how the model's behavior has changed over the course of training.
346372

@@ -364,8 +390,6 @@ Some basic rules for grader selection:
364390

365391
- If you have **complex responses that can be scored on multiple criteria, use multi graders**. This allows you to score different aspects of the response and combine it into an aggregate.
366392

367-
- **Guard against reward hacking**. This happens when the model finds a shortcut that earns high scores without real skill. Make it hard to loophole your grading system.
368-
369393
- **Consider breaking the grader down into multiple steps**, and giving partial credit, to nudge the models reasoning in the right direction, grading is stable and aligned with preference. Provide few-shot examples of great, fair, and poor answers in the prompt.
370394

371395
- **Use an LLM as a-judge when code falls short**. For rich, open ended answers, ask another language model to grade. When building LLM graders, run multiple candidate responses and ground truths through your LLM judge to ensure
@@ -380,10 +404,4 @@ We also provide a grader check API that you can use to check the validity of you
380404

381405
Aim for a few hundred examples initially and consider scaling up to around 1,000 examples if necessary. The dataset should be balanced, in terms of classes predicted, to avoid bias and ensure generalization.
382406

383-
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)
384-
385-
## RFT spending limits
386-
387-
As RFT job can lead to high training costs, we're capping the pricing for per training job billing which means this will be the maximum amount that a job can cost before we end the job, even if weʼre not done processing the entire dataset. The training will be paused and a deployable checkpoint will be created.
388-
389-
Users can validate the training job, metrics, logs and then decide to resume the job to complete further. If the user decides to resume the job, billing will continue for the job and subsequently no further price limits would be placed on the training job.
407+
For the prompts, make sure to provide clear and detailed instructions, including specifying the response format and any constraints on the outputs (e.g. minimum length for explanations, only respond with true/false etc.)

0 commit comments

Comments
 (0)