Skip to content

Commit 5fe84bb

Browse files
authored
fix grpo doc (#3920)
* fix doc * fix json
1 parent a52cd65 commit 5fe84bb

File tree

4 files changed

+30
-10
lines changed

4 files changed

+30
-10
lines changed

docs/source/BestPractices/GRPO多模态训练.md

Lines changed: 8 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,9 +37,14 @@ register_dataset(
3737

3838
```json
3939
{
40-
'images': [{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xe0\x00\x00\x01@\x08\x06\x00\x00\x00d\xc8\xafB`\x82 ...', 'path': 'CLEVR_trainA_000000.png'}],
41-
'messages': [{'role': 'user', 'content': 'How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags.'}],
42-
'solution': '<answer> 3 </answer>'
40+
"images": ["image_path1", "image_path2"],
41+
"messages": [
42+
{
43+
"role": "user",
44+
"content": "How many items are there in the image? Output the thinking process in <think> </think> and \n final answer (number) in <answer> </answer> tags."
45+
}
46+
],
47+
"solution": "<answer> 3 </answer>"
4348
}
4449
```
4550

docs/source/Instruction/GRPO.md

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,11 @@ pip install math_verify # reward function
1010
pip install -U trl
1111
```
1212

13-
**注意**:训练过程中 loss 接近0 是正常情况, 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
13+
**FAQ**
14+
1. 训练过程中 loss 接近0 是正常情况, 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
15+
2. 训练的steps怎么计算? 参考[issue](https://github.com/modelscope/ms-swift/issues/3912)
16+
3. clip_ratio为什么总是1? 参考[issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851)
17+
1418

1519
## 集群支持
1620

@@ -112,7 +116,7 @@ A conversation between User and Assistant. The user asks a question, and the Ass
112116

113117
## 参数与运行脚本
114118
参数
115-
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_eval_batch_size * nproc_per_node 整除
119+
- num_generations: 每个prompt采样的数量,论文中的G值,需要被 per_device_batch_size * nproc_per_node 整除
116120
- max_completion_length: 采样生成的最大长度,默认为512
117121
- ds3_gather_for_generation: 该参数适用于DeepSpeed ZeRO-3。如果启用,策略模型权重将被收集用于生成,从而提高生成速度。然而,禁用此选项允许训练超出单个GPU VRAM的模型,尽管生成速度会变慢。禁用此选项与vLLM生成不兼容。默认为True
118122
- reward_funcs: 奖励函数,根据模型生成结果进行打分,内置accuracy、format、cosine和repetition四个rule-based函数,详细见 swift/plugin/orm.py 文件

docs/source_en/BestPractices/GRPO-Multi-Modal-Training.md

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,10 +40,16 @@ The purpose of redefining the dataset preprocessor here is to modify the query.
4040

4141
```json
4242
{
43-
'images': [{'bytes': b'\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x01\xe0\x00\x00\x01@\x08\x06\x00\x00\x00d\xc8\xafB`\x82 ...', 'path': 'CLEVR_trainA_000000.png'}],
44-
'messages': [{'role': 'user', 'content': 'How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags.'}],
45-
'solution': '<answer> 3 </answer>'
43+
"images": ["image_path1", "image_path2"],
44+
"messages": [
45+
{
46+
"role": "user",
47+
"content": "How many items are there in the image? Output the thinking process in <think> </think> and\n final answer (number) in <answer> </answer> tags."
48+
}
49+
],
50+
"solution": "<answer> 3 </answer>"
4651
}
52+
4753
```
4854

4955
---

docs/source_en/Instruction/GRPO.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,12 @@ pip install math_verify # reward function
1111
pip install -U trl
1212
```
1313

14-
**Note**: It is normal for the loss to approach zero during training. Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.
14+
**FAQ**
15+
1. It is normal for the loss to approach zero during training. Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.
16+
2. How to calculate the training steps? Refer to this [issue](https://github.com/modelscope/ms-swift/issues/3912) for more details.
17+
3. Why is the clip_ratio always 1? Refer to this [issue](https://github.com/huggingface/open-r1/issues/239#issuecomment-2646297851) for more details.
18+
19+
1520

1621
## Cluster Support
1722

@@ -115,7 +120,7 @@ In addition to rule-based reward functions, this framework also supports using r
115120
## Arguments and Execution Script
116121
Arguments
117122

118-
- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_eval_batch_size * - nproc_per_node.
123+
- num_generations: The number of samples for each prompt, referred to as the G value in the paper, needs to be divisible by per_device_batch_size * - nproc_per_node.
119124
- max_completion_length: The maximum length for sampling generation, default is 512.
120125
- ds3_gather_for_generation: This parameter applies to DeepSpeed ZeRO-3. If enabled, the policy model weights are gathered for generation, improving generation speed. However, disabling this option allows training models that exceed the VRAM capacity of a single GPU, albeit at the cost of slower generation. Disabling this option is not compatible with vLLM generation. The default is True.
121126
- reward_funcs: Reward functions to score the results generated by the model. Includes built-in accuracy, format , cosine and repetition rule-based functions, detailed in the swift/plugin/orm.py file.

0 commit comments

Comments
 (0)