Skip to content

Commit 0ed548e

Browse files
authored
Merge verl-related config into default config (#256)
1 parent f79e203 commit 0ed548e

File tree

104 files changed

+671
-1739
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

104 files changed

+671
-1739
lines changed

benchmark/config/countdown-template.yaml

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -47,11 +47,10 @@ buffer:
4747
decay: 0.1
4848
sft_warmup_steps: 0
4949
explorer:
50-
runner_num: 32
50+
runner_per_model: 8
5151
max_timeout: 900
5252
max_retry_times: 2
5353
rollout_model:
54-
engine_type: vllm_async
5554
engine_num: 2
5655
tensor_parallel_size: 1
5756
use_v1: true

benchmark/config/gsm8k-template.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,6 @@ explorer:
5656
max_timeout: 900
5757
max_retry_times: 2
5858
rollout_model:
59-
engine_type: vllm_async
6059
engine_num: 2
6160
tensor_parallel_size: 1
6261
use_v1: true

docs/sphinx_doc/source/tutorial/example_async_mode.md

Lines changed: 14 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -52,8 +52,6 @@ explorer:
5252
synchronizer:
5353
sync_method: 'checkpoint'
5454
sync_interval: 10
55-
trainer:
56-
trainer_config_path: examples/async_gsm8k/verl_config.yaml
5755
```
5856
5957
Key configurations in `trainer.yaml` are as follows:
@@ -95,7 +93,20 @@ synchronizer:
9593
sync_method: 'checkpoint'
9694
sync_interval: 10
9795
trainer:
98-
trainer_config_path: examples/async_gsm8k/verl_config.yaml
96+
trainer_config:
97+
actor_rollout_ref:
98+
model:
99+
use_remove_padding: true
100+
actor:
101+
use_dynamic_bsz: true
102+
ppo_max_token_len_per_gpu: 16384
103+
ulysses_sequence_parallel_size: 1
104+
optim:
105+
lr: 1e-6
106+
ref:
107+
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
108+
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
109+
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
99110
```
100111

101112
You can run this example with the following command:

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ For SFT, we download the `open-r1/Mixture-of-Thoughts` dataset to the local dire
5050

5151
### Configuration for DPO
5252

53-
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
53+
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) for this experiment. Some important setups are listed in the following:
5454

5555
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and pass the data path to the trainer.
5656

@@ -83,8 +83,9 @@ buffer:
8383
chosen_key: chosen
8484
rejected_key: rejected
8585
trainer:
86-
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
8786
save_interval: 30
87+
trainer_config:
88+
... # omitted here for simplicity
8889
```
8990
9091
`buffer.trainer_input.experience_buffer` specifies the dataset to be used for training, including its name, storage type, path, and format.
@@ -129,8 +130,9 @@ buffer:
129130
prompt_type: messages
130131
messages_key: messages
131132
trainer:
132-
trainer_config_path: /PATH/TO/TRAIN_CONFIG_YAML/
133133
save_interval: 50
134+
trainer_config:
135+
... # omitted here for simplicity
134136
```
135137

136138
Here we set `buffer.trainer_input.experience_buffer.format.prompt_type` to `messages` because the source data is in message format. We also set `buffer.trainer_input.experience_buffer.format.messages_key` to `messages` to specify the key in the dataset that contains the messages.

docs/sphinx_doc/source/tutorial/example_mix_algo.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -254,7 +254,7 @@ class MIXPolicyLossFn(PolicyLossFn):
254254

255255
With the above newly-defined classes and functions, we can run the experiments without modifying other process.
256256
An example showing some important configurations is shown below, including the weighting factor $\mu$ as `algorithm.policy_loss_fn_args['mu']` and the batch size of expert experiences $B'$, calculated as the product of `buffer.batch_size`, `algorithm.sample_strategy_args['expert_data_ratio']` and `algorithm.repeat_times`.
257-
For the full configuration, please refer to [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml) and [`train_mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/train_mix_math.yaml).
257+
For the full configuration, please refer to [`mix_math.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/mix_math/mix_math.yaml).
258258

259259
```yaml
260260
algorithm:

docs/sphinx_doc/source/tutorial/example_reasoning_advanced.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,14 @@ Let's continue with the [previous GSM8k example](./example_reasoning_basic.md) a
1212

1313
As an experimental feature of Trinity-RFT, we develop an embarrasingly simple off-policy RL algorithm, termed as OPMD (Online Policy Mirror Descent, inspired by [Kimi k1.5](https://arxiv.org/abs/2501.12599)).
1414
The algorithm design and analysis can be found in this [technical report](../../assets/opmd.pdf).
15-
The config files are [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml) and [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
15+
The config file is [`opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/blob/main/examples/opmd_gsm8k/opmd_gsm8k.yaml).
1616

1717
To try out the OPMD algorithm:
1818
```shell
1919
trinity run --config examples/opmd_gsm8k/opmd_gsm8k.yaml
2020
```
2121

2222
Note that in this config file, `sync_interval` is set to 10, i.e., the model weights of explorer and trainer are synchronized only once every 10 training steps, which leads to a challenging off-policy scenario (potentially with abrupt distribution shift during the RFT process).
23-
Other configurations of particular interest are explained at the beginning of [`train_opmd_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/opmd_gsm8k/train_opmd_gsm8k.yaml).
2423

2524

2625

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -100,7 +100,7 @@ We run the experiment in a synchronous mode where the Explorer and Trainer opera
100100

101101
### Use GRPO Algorithm
102102

103-
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) and [`train_gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/train_gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following:
103+
We use the configurations in [`gsm8k.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_gsm8k/gsm8k.yaml) for this experiment. Some important setups of `gsm8k.yaml` are listed in the following:
104104

105105

106106
```yaml
@@ -155,9 +155,12 @@ synchronizer:
155155
sync_method: 'nccl'
156156
sync_interval: 1
157157
trainer:
158-
trainer_config_path: 'examples/grpo_gsm8k/train_gsm8k.yaml'
159158
save_interval: 100
160-
159+
trainer_config:
160+
actor_rollout_ref:
161+
actor:
162+
optim:
163+
lr: 1e-5
161164
```
162165
163166

docs/sphinx_doc/source/tutorial/example_search_email.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -34,7 +34,7 @@ If you want to choose a new database path, you can modify the `DEFAULT_DB_PATH`
3434

3535
### Step 2: Run the Workflow
3636

37-
The config files are located in [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml) and [`train_email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/train_email_search.yaml).
37+
The config file is located in [`email_search.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/grpo_email_search/email_search.yaml).
3838
To run this example, you can run the following command:
3939

4040
```bash

docs/sphinx_doc/source/tutorial/example_step_wise.md

Lines changed: 14 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -148,9 +148,21 @@ synchronizer:
148148
sync_interval: 2
149149
sync_timeout: 3600
150150
trainer:
151-
trainer_type: 'verl'
152-
trainer_config_path: 'examples/grpo_alfworld_general_multi_step/train_alfworld.yaml'
153151
save_interval: 50
152+
trainer_config:
153+
actor_rollout_ref:
154+
model:
155+
use_remove_padding: true
156+
actor:
157+
use_dynamic_bsz: true
158+
ppo_max_token_len_per_gpu: 16384
159+
ulysses_sequence_parallel_size: 1
160+
optim:
161+
lr: 5e-6
162+
ref:
163+
log_prob_use_dynamic_bsz: ${trainer.trainer_config.actor_rollout_ref.actor.use_dynamic_bsz}
164+
log_prob_max_token_len_per_gpu: ${trainer.trainer_config.actor_rollout_ref.actor.ppo_max_token_len_per_gpu}
165+
ulysses_sequence_parallel_size: ${trainer.trainer_config.actor_rollout_ref.actor.ulysses_sequence_parallel_size} # sp size
154166
```
155167
156168

docs/sphinx_doc/source/tutorial/faq.md

Lines changed: 7 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,21 @@
11
# FAQ
22

33
## Part 1: Configurations
4-
**Q:** Why do most examples have two configuration YAML files, e.g., `gsm8k.yaml` and `train_gsm8k.yaml` in the `examples/grpo_gsm8k` directory?
4+
**Q:** How do I configure the parameters?
55

6-
**A:** Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, and the auxiliary YAML file starting with `train_` is used for configuring veRL, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html).
7-
If you specify the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`, Trinity-RFT will automatically pass the parameters to veRL.
6+
**A:** You can use the config manager to configure the parameters by running `trinity studio --port 8080`. This approach provides a convenient way to configure the parameters.
87

9-
We provide an alternative way to configure the veRL trainer. You may also directly specify the parameters in the `trainer.trainer_config` dictionary. This approach is mutually exclusive with using `trainer.trainer_config_path`.
10-
11-
Note that some parameters are not listed in the auxiliary configuration file (e.g., `train_gsm8k.yaml`), as they will be overridden by the parameters in the trinity configuration file (e.g., `gsm8k.yaml`). Please refer to `./trinity_configs.md` for more details.
12-
For users' convenience, future versions will gradually reduce parameters in `trainer.trainer_config` and `trainer.trainer_config_path` until it's fully deprecated.
8+
Advanced users can also edit the config file directly.
9+
Trinity-RFT uses [veRL](https://github.com/volcengine/verl) as the training backend, which can have massive parameters, referred to [veRL documentation](https://verl.readthedocs.io/en/latest/examples/config.html). You may specify these parameters in two ways: (1) specify the parameters in the `trainer.trainer_config` dictionary; (2) specify them in an auxiliary YAML file starting with `train_` and pass the path to `train_gsm8k.yaml` in `trainer.trainer_config_path`. These two ways are mutually exclusive.
1310

1411
---
1512

16-
**Q:** What's the relationship between `buffer.batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
13+
**Q:** What's the relationship between `buffer.batch_size`, `buffer.train_batch_size`, `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu` and other batch sizes?
1714

1815
**A:** The following parameters are closely related:
1916

20-
- `buffer.batch_size`: The number of tasks in a batch, effective for both the explorer and the trainer.
17+
- `buffer.batch_size`: The number of tasks in a batch, effective for the explorer.
18+
- `buffer.train_batch_size`: The number of experiences in a mini-batch, effective for the trainer. If not specified, it defaults to `buffer.batch_size` * `algorithm.repeat_times`.
2119
- `actor_rollout_ref.actor.ppo_mini_batch_size`: The number of experiences in a mini-batch, overridden by `buffer.train_batch_size`; but in the `update_policy` function, its value becomes the number of experiences in a mini-batch per GPU, i.e., `buffer.train_batch_size (/ ngpus_trainer)`. The expression of dividing `ngpus_trainer` is caused by implict data allocation to GPUs, but this do not affects the result after gradient accumulation.
2220
- `actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu`: The number of experiences in a micro-batch per GPU.
2321

0 commit comments

Comments
 (0)