You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/example_reasoning_basic.md
+4-3Lines changed: 4 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -77,6 +77,7 @@ buffer:
77
77
response_key: 'answer'
78
78
rollout_args:
79
79
temperature: 1.0
80
+
default_workflow_type: 'math_workflow'
80
81
eval_tasksets:
81
82
- name: gsm8k-eval
82
83
storage_type: file
@@ -86,15 +87,15 @@ buffer:
86
87
format:
87
88
prompt_key: 'question'
88
89
response_key: 'answer'
89
-
default_workflow_type: 'math_workflow'
90
+
default_workflow_type: 'math_workflow'
90
91
trainer_input:
91
92
experience_buffer:
92
93
name: gsm8k_buffer
93
94
storage_type: queue
94
95
path: 'sqlite:///gsm8k.db'
95
96
explorer:
96
97
eval_interval: 50
97
-
runner_num: 16
98
+
runner_per_model: 16
98
99
rollout_model:
99
100
engine_num: 1
100
101
synchronizer:
@@ -117,7 +118,7 @@ trinity run --config examples/grpo_gsm8k/gsm8k.yaml
117
118
118
119
## Optional: RFT with SFT Warmup
119
120
120
-
Before RFT, we may use SFT as a warmup step. Trinity-RFT supports adding SFT warmup stage before RFT by setting `stages` in the config file. The `sft_warmup_dataset` specifies the dataset used for SFT warmup, and `sft_warmup_steps` specifies the number of training steps for SFT warmup.
121
+
Before RFT, we may use SFT as a warmup step. Trinity-RFT supports adding SFT warmup stage before RFT by setting `stages` in the config file. The `experience_buffer` specifies the dataset used for SFT warmup, and `total_steps` specifies the number of training steps for SFT warmup.
121
122
122
123
```yaml
123
124
# Properly add the following configs in gsm8k.yaml
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
+10-6Lines changed: 10 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -200,6 +200,7 @@ buffer:
200
200
batch_size: 32
201
201
train_batch_size: 256
202
202
total_epochs: 100
203
+
total_steps: null
203
204
204
205
explorer_input:
205
206
taskset:
@@ -214,9 +215,6 @@ buffer:
214
215
...
215
216
buffer_2:
216
217
...
217
-
218
-
default_workflow_type: 'math_workflow'
219
-
default_reward_fn_type: 'countdown_reward'
220
218
```
221
219
222
220
- `batch_size`: Number of tasks used per training step. *Please do not multiply this value by the `algorithm.repeat_times` manually*.
@@ -231,6 +229,9 @@ Defines the dataset(s) used by the explorer for training and evaluation.
231
229
```yaml
232
230
buffer:
233
231
explorer_input:
232
+
default_workflow_type: 'math_workflow'
233
+
default_eval_workflow_type: 'math_workflow'
234
+
default_reward_fn_type: 'countdown_reward'
234
235
taskset:
235
236
name: countdown_train
236
237
storage_type: file
@@ -262,7 +263,10 @@ buffer:
262
263
```
263
264
264
265
-`buffer.explorer_input.taskset`: Task dataset used for training exploration policies.
265
-
-`buffer.explorer_input.eval_taskset`: List of task datasets used for evaluation.
266
+
-`buffer.explorer_input.eval_tasksets`: List of task datasets used for evaluation.
267
+
-`buffer.explorer_input.default_workflow_type`: Default workflow type for all task datasets under `explorer_input` if not specified at the dataset level.
268
+
-`buffer.explorer_input.default_eval_workflow_type`: Default evaluation workflow type for all eval task datasets under `explorer_input` if not specified at the dataset level.
269
+
-`buffer.explorer_input.default_reward_fn_type`: Default reward function type for all task datasets under `explorer_input` if not specified at the dataset level.
266
270
267
271
The configuration for each task dataset is defined as follows:
268
272
@@ -413,7 +417,7 @@ trainer:
413
417
save_strategy: "unrestricted"
414
418
grad_clip: 1.0
415
419
use_dynamic_bsz: true
416
-
ppo_max_token_len_per_gpu: 16384
420
+
max_token_len_per_gpu: 16384
417
421
ulysses_sequence_parallel_size: 1
418
422
trainer_config: null
419
423
```
@@ -429,7 +433,7 @@ trainer:
429
433
- `unrestricted`: No restrictions on saving operations; multiple nodes, processes, or threads are allowed to save the model simultaneously.
430
434
- `grad_clip`: Gradient clipping for updates.
431
435
- `use_dynamic_bsz`: Whether to use dynamic batch size.
432
-
- `ppo_max_token_len_per_gpu`: The maximum number of tokens to be processed in forward and backward when updating the policy. Effective when `use_dynamic_bsz=true`.
436
+
- `max_token_len_per_gpu`: The maximum number of tokens to be processed in forward and backward when updating the policy. Effective when `use_dynamic_bsz=true`.
0 commit comments