Skip to content

Commit cf7cbad

Browse files
committed
Merge branch 'main' into dev/async
2 parents 241d8bd + 50aba82 commit cf7cbad

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1129
-395
lines changed

.github/workflows/docker/docker-compose.yaml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@ services:
88
- RAY_ADDRESS=auto
99
- CHECKPOINT_ROOT_DIR=/mnt/checkpoints
1010
- DATA_ROOT_DIR=/mnt/data
11-
- MODEL_PATH=/mnt/checkpoints/Qwen2.5-1.5B-Instruct
11+
- MODEL_PATH=/mnt/models/Qwen3-1.7B
12+
- CHECKPOINT_PATH=/mnt/checkpoints
1213
working_dir: /workspace
1314
networks:
1415
- trinity-network
@@ -32,7 +33,8 @@ services:
3233
- HF_ENDPOINT=https://hf-mirror.com
3334
- CHECKPOINT_ROOT_DIR=/mnt/checkpoints
3435
- DATA_ROOT_DIR=/mnt/data
35-
- MODEL_PATH=/mnt/checkpoints/Qwen2.5-1.5B-Instruct
36+
- MODEL_PATH=/mnt/models/Qwen3-1.7B
37+
- CHECKPOINT_PATH=/mnt/checkpoints
3638
working_dir: /workspace
3739
volumes:
3840
- trinity-volume:/mnt

.github/workflows/unittest.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
- name: Run unittest
3737
working-directory: trinity-${{ github.run_id }}/.github/workflows/docker
3838
run: |
39-
docker compose exec trinity-node-1 pytest tests --ignore=tests/data --ctrf report.json
39+
docker compose exec trinity-node-1 pytest tests -v -s --ignore=tests/data --ctrf report.json
4040
4141
- name: Upload test results
4242
uses: actions/upload-artifact@v4

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,3 +94,6 @@ modules.rst
9494

9595
# wandb
9696
wandb/
97+
98+
# checkpoints
99+
checkpoints/

docs/sphinx_doc/source/conf.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,6 @@
4040

4141
templates_path = ["_templates"]
4242
exclude_patterns = ["build"]
43-
autodoc_mock_imports = ["ray"]
4443

4544
autodoc_default_options = {
4645
"members": True,

docs/sphinx_doc/source/tutorial/example_data_functionalities.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -244,7 +244,7 @@ You can set more config items for this OP (e.g. notification when annotation is
244244

245245
When you start running with the RFT config, the data module will start the OP `human_preference_annotation_mapper`, and then you can find a new project on the "Projects" page of the label-studio server.
246246

247-
![]("../../assets/data-projects.png")
247+
![](../../assets/data-projects.png)
248248

249249
You can click and enter into this project, and all the samples that need to be annotated are listed on the page.
250250

docs/sphinx_doc/source/tutorial/example_dpo.md

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -40,13 +40,13 @@ Note that the dataset has the keys `prompt`, `chosen` and `rejected`. If not, pa
4040

4141
We use the configurations in [`dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/dpo.yaml) and [`train_dpo.yaml`](https://github.com/modelscope/Trinity-RFT/tree/main/examples/dpo_humanlike/train_dpo.yaml) for this experiment. Some important setups are listed in the following:
4242

43-
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `offline`. The value of `sync_iteration_interval` can be set as same of the value of `save_freq`.
43+
We run the experiment in a train mode, as there is no Explorer. To enable this mode, we config `mode` to `train` and set `sync_method` to `checkpoint`. The value of `sync_iteration_interval` can be set as same of the value of `save_interval`.
4444

4545
```yaml
4646
# In dpo.yaml
4747
mode: train
4848
synchronizer:
49-
sync_method: 'offline'
49+
sync_method: 'checkpoint'
5050
buffer:
5151
train_dataset:
5252
storage_type: file
@@ -63,7 +63,6 @@ trainer:
6363
# In train_dpo.yaml
6464
actor_rollout_ref:
6565
actor:
66-
alg_type: dpo
6766
use_kl_loss: True
6867
kl_loss_coef: 0.1 # value of beta in DPO
6968
```

docs/sphinx_doc/source/tutorial/example_multi_turn.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -122,5 +122,5 @@ and include them in the init files in `trinity/common/workflows/__init__.py`
122122

123123
Then you are all set! It should be pretty simple😄, and both environments converge.
124124

125-
![]("../../assets/alfworld_reward_curve.png")
126-
![]("../../assets/webshop_reward_curve.png")
125+
![](../../assets/alfworld_reward_curve.png)
126+
![](../../assets/webshop_reward_curve.png)

docs/sphinx_doc/source/tutorial/example_reasoning_basic.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,7 @@ We run the experiment in a synchronous mode where the Explorer and Trainer opera
4242
```yaml
4343
mode: both
4444
synchronizer:
45-
sync_method: 'online'
45+
sync_method: 'nccl'
4646
sync_iteration_interval: 2
4747
```
4848

docs/sphinx_doc/source/tutorial/trinity_configs.md

Lines changed: 8 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -15,17 +15,6 @@ monitor:
1515
- `monitor.name`: The name of the experiment. It must be set manually.
1616

1717

18-
## Monitor
19-
20-
```yaml
21-
monitor:
22-
project: "Trinity-RFT-countdown"
23-
name: "qwen2.5-1.5B-countdown"
24-
```
25-
26-
- `monitor.project`: The project name. It must be set manually.
27-
- `monitor.name`: The name of the experiment. It must be set manually.
28-
2918
## Data
3019

3120
<!-- The `data` configuration specifies the data used for training. It includes the total number of epochs, the batch size, the path to the dataset, the default workflow type, the default reward function type, and the format configuration. -->
@@ -131,8 +120,6 @@ explorer:
131120
enforce_eager: true
132121
dtype: bfloat16
133122
temperature: 1.0
134-
top_p: 1.0
135-
top_k: -1
136123
seed: 42
137124
logprobs: 0
138125
repeat_times: 5
@@ -150,8 +137,6 @@ explorer:
150137
- `explorer.enforce_eager`: Whether to enforce eager mode. Default is `True`.
151138
- `explorer.dtype`: The data type used in vLLM. Default is `bfloat16`.
152139
- `explorer.temperature`: The temperature used in vLLM. Default is `1.0`.
153-
- `explorer.top_p`: The top-p used in vLLM. Default is `1.0`.
154-
- `explorer.top_k`: The top-k used in vLLM. Default is `-1`.
155140
- `explorer.seed`: The seed used in vLLM. Default is `42`.
156141
- `explorer.logprobs`: The logprobs used in vLLM. Default is `0`.
157142
- `explorer.repeat_times`: The number of times to repeat each task, used for GRPO-like algorithms. Default is `5`.
@@ -164,12 +149,16 @@ explorer:
164149

165150
```yaml
166151
synchronizer:
167-
sync_method: 'online'
152+
sync_method: 'nccl'
168153
sync_iteration_interval: 10
154+
sync_timeout: 1200
169155
```
170156

171-
- `synchronizer.sync_method`: The synchronization method, Support `online` and `offline`. Default is `online`.
157+
- `synchronizer.sync_method`: The synchronization method between `trainer` and `explorer`.
158+
Support `nccl` and `checkpoint`, `nccl` represents that model weights in `explorer` will be synchronized from `trainer` through `nccl`,
159+
`checkpoint` represents that `explorer` will load the newest checkpoints saved by `trainer` then update its model weights. Default is `nccl`.
172160
- `synchronizer.sync_iteration_interval`: The interval between two synchronizations. Default is `10`. It should be set manually.
161+
- `synchronizer.sync_timeout`: The timeout of the synchronization. Default is `1200`.
173162

174163
## Trainer
175164

@@ -180,13 +169,15 @@ trainer:
180169
trainer_config_path: 'examples/ppo_countdown/train_countdown.yaml'
181170
sft_warmup_iteration: 0
182171
eval_interval: 1000
172+
save_interval: 100
183173
```
184174

185175
- `trainer.trainer_type`: The backend of the trainer, Only `verl` is supported.
186176
- `trainer.algorithm_type`: The type of the algorithm, Support `ppo`, `grpo`, `opmd` and `dpo`.
187177
- `trainer.trainer_config_path`: The path to the trainer configuration file. It must be set manually.
188178
- `trainer.sft_warmup_iteration`: The number of iterations to warm up the model. Default is `0`.
189179
- `trainer.eval_interval`: The interval between two evaluations. Default is `1000`.
180+
- `trainer.save_interval`: The interval between two checkpoints. Default is `100`.
190181

191182
### veRL Trainer Configuration
192183

@@ -249,7 +240,6 @@ actor_rollout_ref:
249240
optimizer_offload: False
250241
fsdp_size: -1
251242
# --- below: opmd ---
252-
alg_type: ppo # ppo / opmd / pairwise_opmd
253243
tau: 0.000 # strength of regularization w.r.t. old / ref policy
254244
opmd_baseline: mean # mean / logavgexp, applicable to opmd
255245
use_uid: False # True / False, applicable to pairwise_opmd
@@ -403,7 +393,6 @@ trainer:
403393
- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss.
404394
- `actor_rollout_ref.actor.kl_loss_type`: How to compute kl loss, optional value is `kl`, `abs`, `mse` or `low_var_kl`.
405395
- `actor_rollout_ref.actor.ulysses_sequence_parallel_size`: Ulysses sequence parallel size.
406-
- `actor_rollout_ref.actor.alg_type`: Used for OPMD, optional value is `ppo`, `opmd` or `pairwise_opmd`.
407396
- `actor_rollout_ref.actor.tau`: strength of regularization w.r.t. old / ref policy.
408397
- `actor_rollout_ref.actor.opmd_baseline`: mean / logavgexp, applicable to opmd.
409398
- `actor_rollout_ref.actor.use_uid`: True / False, applicable to pairwise_opmd.
@@ -427,7 +416,6 @@ trainer:
427416
- `algorithm`: Training algorithm settings.
428417

429418
- `trainer.balance_batch`: Whether to balance batch size between GPUs during training.
430-
- `trainer.save_freq`: Frequency of saving checkpoints.
431419
- `trainer.resume_mode`: Resume mode for training. Support `disable`, `auto` and `resume_path`.
432420
- `trainer.resume_from_path`: Path to resume from.
433421
- `trainer.critic_warmup`: The number of iteration to train the critic model before actual policy learning.

examples/dpo_humanlike/dpo.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -37,8 +37,6 @@ explorer:
3737
enforce_eager: true
3838
dtype: bfloat16
3939
temperature: 1.0
40-
top_p: 1.0
41-
top_k: -1
4240
seed: 42
4341
logprobs: 0
4442
repeat_times: 1 # NOTE
@@ -47,12 +45,14 @@ explorer:
4745
max_pending_requests: 32
4846
max_waiting_steps: 4
4947
synchronizer:
50-
sync_method: 'offline'
48+
sync_method: 'checkpoint'
5149
sync_iteration_interval: 30
50+
sync_timeout: 1200
5251
trainer:
5352
trainer_type: 'verl'
5453
algorithm_type: dpo
5554
trainer_config_path: 'examples/dpo_humanlike/train_dpo.yaml'
55+
save_interval: 30
5656
monitor:
5757
cache_root_dir: ""
5858
project: "dpo_example"

0 commit comments

Comments
 (0)