Skip to content

Commit 106f671

Browse files
chenjiaoAngelShangwei-Li
authored andcommitted
[fsdp] feat: add validate process on trainer node when use_trainer_do_validate=True (verl-project#4683)
### What does this PR do? > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. User Trainer node to do validate process when run mode on fully-async, It can save time for validate computing and reduce perf/time_of_step peak - add new use_trainer_do_validate on fully_async async_training config to decide whether using trainer node to do validate process - use_trainer_do_validate: default is false - It can improve performance of validate, such as in `dapo_7b_math_fsdp2_8_8.sh`, it can improve about 1X speed <img width="1440" height="608" alt="image" src="https://github.com/user-attachments/assets/436e481e-4f51-4e8e-ad08-b038b3f0e89d" /> <img width="1030" height="762" alt="image" src="https://github.com/user-attachments/assets/ed8e3237-d37d-4eff-b944-fb81ea63f87c" /> - optimized the `process_validation_metrics()` on `_validate()` process, when input datasets len=1444, it latency reduce from 150+s to 40+s <img width="2630" height="448" alt="image" src="https://github.com/user-attachments/assets/b6fb50bc-5856-49c1-91dc-f845e9c410b4" /> <img width="2504" height="518" alt="image" src="https://github.com/user-attachments/assets/b3b5f238-0c5e-4c63-9683-83f34d5a46fd" /> ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. - on test scripts such as `dapo_7b_math_fsdp2_8_8.sh` add `async_training.use_trainer_do_validate=True` command to do compute - the result of this function on Qwen2.5-Math-7B model - the baseline scripts is `dapo_7b_math_fsdp2_8_8.sh` - the optimized scripts is `dapo_7b_math_fsdp2_8_8.sh` +`async_training.use_trainer_do_validate=True` - the acc and perfomance is below: <img width="1650" height="702" alt="image" src="https://github.com/user-attachments/assets/3419d7bb-a64c-4fe9-b776-3312925f51ab" /> <img width="1580" height="522" alt="image" src="https://github.com/user-attachments/assets/2c3a7e24-7421-4f12-8527-7b997f9c3b89" /> - green: optimized case (`async_training.use_trainer_do_validate=True` ) - gray: baseline case (`async_training.use_trainer_do_validate=False` ) ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this async_training.use_trainer_do_validate=True \ ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) --------- Co-authored-by: Shangwei-Li <lishangwei@mail.ustc.edu.cn>
1 parent 938a073 commit 106f671

File tree

18 files changed

+1570
-58
lines changed

18 files changed

+1570
-58
lines changed

docs/advance/fully_async.md

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ but the overlap in their time consumption reduces the end-to-end time consumptio
105105
| `async_training.checkpoint_engine.enable` | Whether to use checkpoint_engine for accelerating, default `True` |
106106
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False` |
107107
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096` |
108+
| `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False`|
108109

109110
**Further Explanation:**
110111

@@ -194,6 +195,13 @@ but the overlap in their time consumption reduces the end-to-end time consumptio
194195
- When disable `overlap_broadcast_and_consume`, the additional device memory overhead of
195196
trainer rank is `2 * bucket_size`and rollout rank is `1 * bucket_size`
196197

198+
* `async_training.use_trainer_do_validate`
199+
200+
It controls whether to use the trainer's `do_validate` method for validation.
201+
If set to True, the trainer will perform validation after each parameter update. It can reduce the validation time
202+
overhead and trainer node idle time.
203+
If set to False, the trainer will not perform validation.
204+
197205
### Supported Modes
198206

199207
1. on policy pipeline:
@@ -477,6 +485,35 @@ We tested the single-step parameter synchronization time of the checkpoint-engin
477485
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
478486
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
479487

488+
### use_trainer_do_validate Experiment
489+
490+
We tested the effect of setting `use_trainer_do_validate=True` on the training process. The results show that setting
491+
this parameter to True can reduce the validation time overhead and trainer node idle time.
492+
We used Qwen2.5-Math-7B to verify the benefits of `use_trainer_do_validate=True` on the training process, we achieved about 2x performance improvement on validation time, and the trainer node idle time is reduced by about 40%.
493+
494+
* Machine: H20
495+
* Model: Qwen2.5-Math-7B
496+
* Rollout length: max_response_length FSDP2: 10K tokens;
497+
* Algorithm: DAPO
498+
* Dataset: TRAIN_FILE: dapo-math-17k.parquet TEST_FILE: aime-2024.parquet
499+
* Engine: vllm+FSDP2
500+
* rollout.n: 16
501+
* ppo_mini_batch_size: 32
502+
* test_freq: 10
503+
504+
* fully_async_policy
505+
* total_rollout_steps: 512*400
506+
* require_batches: 4
507+
* trigger_parameter_sync_step: 4
508+
* staleness_threshold: 0.5
509+
* partial_rollout: True
510+
511+
| training mode | resource allocation | step | gen | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |
512+
|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|
513+
| colocate sync | 16 | 484.623 | 52.939 | 0 | 430.263 | 205.080 | 7h9m | 22.6 |
514+
| fully_async_policy | 8:8 | 489.953 | 52.622 | 0 | 435.874 | 95.699 | 7h2m | 21.0 |
515+
516+
480517
## Multi-Turn Tool Calling
481518

482519
Referencing **recipe/retool** and **ToolAgentLoop**, we implemented **AsyncPartialToolAgentLoop**, a multi-turn

recipe

Submodule recipe deleted from 21892b9

recipe/fully_async_policy/README.md

Lines changed: 597 additions & 0 deletions
Large diffs are not rendered by default.

recipe/fully_async_policy/README_zh.md

Lines changed: 517 additions & 0 deletions
Large diffs are not rendered by default.

verl/experimental/agent_loop/agent_loop.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -911,7 +911,7 @@ def _init_agent_loop_workers(self):
911911
node_id = node_ids[i % len(node_ids)]
912912
self.agent_loop_workers.append(
913913
self.agent_loop_workers_class.options(
914-
name=f"agent_loop_worker_{i}",
914+
name=f"agent_loop_worker_{i}" + f"_{uuid4().hex[:8]}",
915915
scheduling_strategy=ray.util.scheduling_strategies.NodeAffinitySchedulingStrategy(
916916
node_id=node_id, soft=True
917917
),

verl/experimental/fully_async_policy/README.md

Lines changed: 40 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,7 @@ but the overlap in their time consumption reduces the end-to-end time consumptio
105105
| `async_training.checkpoint_engine.enable` | Whether to use checkpoint_engine for accelerating, default `True` |
106106
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False` |
107107
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096` |
108+
| `async_training.use_trainer_do_validate` | Whether use trainer node to do validate process, default `False` |
108109

109110
**Further Explanation:**
110111

@@ -194,6 +195,13 @@ but the overlap in their time consumption reduces the end-to-end time consumptio
194195
- When disable `overlap_broadcast_and_consume`, the additional device memory overhead of
195196
trainer rank is `2 * bucket_size`and rollout rank is `1 * bucket_size`
196197

198+
- `async_training.use_trainer_do_validate`
199+
200+
It controls whether to use the trainer's `do_validate` method for validation.
201+
If set to True, the trainer will perform validation after each parameter update. It can reduce the validation time
202+
overhead and trainer node idle time.
203+
If set to False, the trainer will not perform validation.
204+
197205
### Supported Modes
198206

199207
1. on policy pipeline:
@@ -477,6 +485,38 @@ We tested the single-step parameter synchronization time of the checkpoint-engin
477485
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
478486
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
479487

488+
489+
### use_trainer_do_validate Experiment
490+
491+
We tested the effect of setting `use_trainer_do_validate=True` on the training process. The results show that setting
492+
this parameter to True can reduce the validation time overhead and trainer node idle time.
493+
We used Qwen2.5-Math-7B to verify the benefits of `use_trainer_do_validate=True` on the training process, we achieved about 2x performance improvement on validation time, and the trainer node idle time is reduced by about 40%.
494+
495+
- Machine: H20
496+
- Model: Qwen2.5-Math-7B
497+
- Rollout length: max_response_length FSDP2: 10K tokens;
498+
- Algorithm: DAPO
499+
- Dataset:
500+
- TRAIN_FILE: dapo-math-17k.parquet
501+
- TEST_FILE: aime-2024.parquet
502+
- Engine: vllm+FSDP2
503+
- rollout.n: 16
504+
- ppo_mini_batch_size: 32
505+
- test_freq: 10
506+
507+
- fully_async_policy
508+
- total_rollout_steps: 512*400
509+
- require_batches: 4
510+
- trigger_parameter_sync_step: 4
511+
- staleness_threshold: 0.5
512+
- partial_rollout: True
513+
514+
| training mode | resource allocation | step | gen | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |
515+
|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|
516+
| colocate sync | 16 | 484.623 | 52.939 | 0 | 430.263 | 205.080 | 7h9m | 22.6 |
517+
| fully_async_policy | 8:8 | 489.953 | 52.622 | 0 | 435.874 | 95.699 | 7h2m | 21.0 |
518+
| fully_async_policy_opt_validate | 8:8 | | | 0 | | | | |
519+
480520
## Multi-Turn Tool Calling
481521

482522
Referencing **recipe/retool** and **ToolAgentLoop**, we implemented **AsyncPartialToolAgentLoop**, a multi-turn

verl/experimental/fully_async_policy/README_zh.md

Lines changed: 37 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,7 @@ fully_async_policy 的整体架构如下图所示,fully_async_policy 主要由
8282
| `async_training.checkpoint_engine.enable` | 是否开启 checkpoint_engine 模式的加速,默认值 True |
8383
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | 启动 checkpoint_engine 时,是否在参数同步时在 broadcast 和加载之间使用流水,默认值 False |
8484
| `async_training.checkpoint_engine.device_buffer_size_M` | 启动 checkpoint_engine 时,组装的 bucket 的大小(MB),默认为 4096 |
85-
85+
| `async_training.use_trainer_do_validate` | 是否使用 Trainer 的 do_validate 方法进行 validation,默认值 False |
8686
**进一步的解释:**
8787

8888
- `rollout.total_rollout_steps`
@@ -155,6 +155,12 @@ require_batches * ppo_mini_batch_size)`。
155155
- 在开启`overlap_broadcast_and_consume`时,trainer 节点的临时额外显存开销为 `3 * bucket_size`, rollout 节点的临时额外显存开销为`2 * bucket_size`
156156
- 在关闭`overlap_broadcast_and_consume`时,trainer 节点的临时额外显存开销为 `2 * bucket_size`, rollout 节点的临时额外显存开销为`1 * bucket_size`
157157

158+
- `async_training.use_trainer_do_validate`
159+
160+
控制是否使用trainer的 `do_validate` 方法进行 validation 。
161+
如果设置为 True,trainer 会在每次参数更新后,调用 `do_validate` 方法进行 validation。
162+
如果设置为 False,trainer 不会调用 `do_validate` 方法。
163+
158164
### 模式支持
159165

160166
1. on policy pipeline:
@@ -407,6 +413,36 @@ GPU 数量整除,这使得资源调整的灵活性受到影响。此外,随
407413
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
408414
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
409415

416+
### use_trainer_do_validate 实验测试
417+
418+
我们在Qwen2.5-Math-7B模型上测试了 `use_trainer_do_validate` 参数的影响。这个结果展示使用 `use_trainer_do_validate=True` 可以减少验证时间开销,并且训练器节点的空闲时间也减少了。
419+
420+
- Machine: H20
421+
- Model: Qwen2.5-Math-7B
422+
- Rollout length: max_response_length FSDP2: 10K tokens;
423+
- Algorithm: DAPO
424+
- Dataset:
425+
- TRAIN_FILE: dapo-math-17k.parquet
426+
- TEST_FILE: aime-2024.parquet
427+
- Engine: vllm+FSDP2
428+
- rollout.n: 16
429+
- ppo_mini_batch_size: 32
430+
- test_freq: 10
431+
432+
- fully_async_policy
433+
- total_rollout_steps: 512*400
434+
- require_batches: 4
435+
- trigger_parameter_sync_step: 4
436+
- staleness_threshold: 0.5
437+
- partial_rollout: True
438+
439+
| training mode | resource allocation | step | gen | old_log_prob | update_actor | validate time | total time<br>50 step | acc/mean@2 |
440+
|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|:---------------:|
441+
| colocate sync | 16 | 484.623 | 52.939 | 0 | 430.263 | 205.080 | 7h9m | 22.6 |
442+
| fully_async_policy | 8:8 | 489.953 | 52.622 | 0 | 435.874 | 95.699 | 7h2m | 21.0 |
443+
| fully_async_policy_opt_validate | 8:8 | | | 0 | | | | |
444+
445+
410446
## 多轮工具调用
411447

412448
参考 **recipe/retool****ToolAgentLoop**,我们为 **fully_async_policy** 实现了支持 partial rollout 的多轮工具调用循环 \*

verl/experimental/fully_async_policy/config/fully_async_ppo_megatron_trainer.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ async_training:
2727
# compute_prox_log_prob
2828
compute_prox_log_prob: False
2929

30+
# whether to use trainer do_validate
31+
use_trainer_do_validate: False
32+
33+
3034
# checkpoint_engine config for accelerating parameter synchronization between rollouter and trainer
3135
checkpoint_engine:
3236
# Whether to use checkpoint_engine

verl/experimental/fully_async_policy/config/fully_async_ppo_trainer.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,10 @@ async_training:
2727
# compute_prox_log_prob
2828
compute_prox_log_prob: False
2929

30+
# whether to use trainer do_validate
31+
use_trainer_do_validate: False
32+
33+
3034
# checkpoint_engine config for accelerating parameter synchronization between rollouter and trainer
3135
checkpoint_engine:
3236
# Whether to use checkpoint_engine

verl/experimental/fully_async_policy/fsdp_workers.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ def sync_rollout_weights(self, sync_group_name="actor_rollout"):
8989
if self._is_actor and self._is_offload_param:
9090
load_fsdp_model_to_gpu(self.actor_module_fsdp)
9191
params = self._get_actor_params() if self._is_actor else None
92-
if self._is_rollout:
92+
if self._is_rollout and (not self._is_actor):
9393
inference_model = get_inference_model(self.rollout)
9494

9595
from verl.utils.vllm.patch import patch_vllm_moe_model_weight_loader
@@ -107,7 +107,7 @@ def sync_rollout_weights(self, sync_group_name="actor_rollout"):
107107
from ray.util.collective import collective
108108

109109
collective.broadcast(tensor, src_rank=0, group_name=sync_group_name)
110-
if self._is_rollout:
110+
if self._is_rollout and (not self._is_actor):
111111
inference_model.load_weights([(key, tensor)])
112112

113113
if self._is_actor and self._is_offload_param:
@@ -159,7 +159,7 @@ def sync_rollout_weights_by_checkpoint(self, sync_group_name="actor_rollout"):
159159
update_start_time = time.time()
160160

161161
inference_model = None
162-
if self._is_rollout:
162+
if self._is_rollout and (not self._is_actor):
163163
inference_model = get_inference_model(self.rollout)
164164
from verl.utils.vllm.patch import patch_vllm_moe_model_weight_loader
165165

0 commit comments

Comments
 (0)