You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/sphinx_doc/source/tutorial/trinity_configs.md
+22-5Lines changed: 22 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -3,6 +3,18 @@
3
3
The following is the main config file for Trinity-RFT. Take `countdown.yaml` as an example.
4
4
5
5
6
+
## Monitor
7
+
8
+
```yaml
9
+
monitor:
10
+
project: "Trinity-RFT-countdown"
11
+
name: "qwen2.5-1.5B-countdown"
12
+
```
13
+
14
+
- `monitor.project`: The project name. It must be set manually.
15
+
- `monitor.name`: The name of the experiment. It must be set manually.
16
+
17
+
6
18
## Monitor
7
19
8
20
```yaml
@@ -33,7 +45,7 @@ data:
33
45
max_retry_times: 3
34
46
max_retry_interval: 1
35
47
36
-
total_epoch: 20
48
+
total_epochs: 20
37
49
batch_size: 96
38
50
default_workflow_type: 'math_workflow'
39
51
default_reward_fn_type: 'countdown_reward'
@@ -47,7 +59,7 @@ data:
47
59
- `data.db_url`: The URL of the database.
48
60
- `data.max_retry_times`: The maximum number of retries when loading the dataset from database.
49
61
- `data.max_retry_interval`: The maximum interval between retries when loading the dataset from database.
50
-
- `data.total_epoch`: The total number of epochs to explore the dataset. Default is `1`. It should be set manually.
62
+
- `data.total_epochs`: The total number of epochs to explore the dataset. Default is `1`. It should be set manually.
51
63
- `data.batch_size`: The number of `Task` in one training batch. The real batch size used in training is `data.batch_size` * `actor_rollout_ref.rollout.n` Default is `1`. It should be set manually.
52
64
- `data.default_workflow_type`: The default workflow type used for training.
53
65
- `data.default_reward_fn_type`: The default reward function type used for training.
@@ -345,10 +357,14 @@ algorithm:
345
357
gamma: 1.0
346
358
lam: 1.0
347
359
adv_estimator: gae
360
+
norm_adv_by_std_in_grpo: True
361
+
use_kl_in_reward: False
348
362
kl_penalty: kl # how to estimate kl divergence
349
363
kl_ctrl:
350
364
type: fixed
351
365
kl_coef: 0.001
366
+
horizon: 10000
367
+
target_kl: 0.1
352
368
353
369
trainer:
354
370
balance_batch: True
@@ -363,7 +379,7 @@ trainer:
363
379
save_freq: 100
364
380
# auto: find the last ckpt to resume. If can't find, start from scratch
365
381
resume_mode: auto # or auto or resume_path if
366
-
resume_from_path: False
382
+
resume_from_path: ""
367
383
test_freq: 100
368
384
critic_warmup: 0
369
385
default_hdfs_dir: null
@@ -383,8 +399,9 @@ trainer:
383
399
- `actor_rollout_ref.actor.grad_clip`: Gradient clip for actor model training.
384
400
- `actor_rollout_ref.actor.clip_ratio`: Used for compute policy loss.
385
401
- `actor_rollout_ref.actor.entropy_coeff`: Used for compute policy loss.
386
-
- `actor_rollout_ref.actor.use_kl_loss`: True for GRPO.
387
-
- `actor_rollout_ref.actor.kl_loss_coef`: Used for GRPO, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
402
+
- `actor_rollout_ref.actor.use_kl_loss`: Whether to enable kl loss.
403
+
- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss.
404
+
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
0 commit comments