one step off policy中dapo_7b_math_fsdp2训练效率没有提升，无法复现

我在使用 verl 0.5.0 的 recipe: one_step_off_policy 时遇到了一些困惑，想向你们请教。

我使用的是你们提供的脚本：
https://github.com/volcengine/verl/blob/main/recipe/one_step_off_policy/dapo_7b_math_fsdp2_4_12.sh

但在实际训练过程中，未能复现你们在 W&B 上展示的效果：
https://wandb.ai/hou-zg-meituan/one-step-off-policy/workspace?nw=nwuserhouzg

我的问题如下：

1. 我分别使用 dapo_7b_math_fsdp2_4_12.sh 和 dapo_7b_math_fsdp2_colocate.sh 进行训练，但观察到两者的 timing_s/step 基本没有差异，也就是说 one step off policy 的训练效率并未体现出提升。

2. 你们只公开了 dapo_7b_math_megatron 的 source data，能否提供一下 dapo_7b_math_fsdp2 的训练曲线？

3. 从资源配置上看，colocate 的 GPU 资源更多，理论上 timing_s/old_log_prob 的耗时应该更少。我的实验结果确实符合这一预期，但在你们提供的曲线中，colocate 的 timing_s/old_log_prob 反而更大，这点比较困惑。

以下是我的训练效果：

<img width="1272" height="727" alt="Image" src="https://github.com/user-attachments/assets/1794cfc6-c1e8-43bd-b98b-b10773263701" />

<img width="1264" height="695" alt="Image" src="https://github.com/user-attachments/assets/20040928-61b3-4cfd-886b-2e43500db38c" />

希望你们能解答这几个问题，感谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

one step off policy中dapo_7b_math_fsdp2训练效率没有提升，无法复现 #8

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

one step off policy中dapo_7b_math_fsdp2训练效率没有提升，无法复现 #8

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions