Skip to content

Commit 3df21a6

Browse files
authored
[BREAKING][reward] refactor: deprecate batch reward manager (verl-project#5237)
### What does this PR do? This PR does the following refactor effort: 1. Unify the reward managers (i.e., `config.reward_manager` and `config.reward_model.reward_manager`) into one, accepting both the `register` reward manager and the user-customized module. 2. Remove legacy implementation and modify the relevant scripts and docs. ``` reward_manager: _target_: verl.workers.config.reward_model.RewardManagerConfig source: register name: naive module: _target_: verl.trainer.config.config.ModuleConfig path: null name: custom_reward_manager ``` ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: ... - [ ] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [ ] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`.
1 parent 69c58ee commit 3df21a6

File tree

69 files changed

+214
-610
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

69 files changed

+214
-610
lines changed

docs/advance/reward_loop.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,7 +204,7 @@ See ``verl/experimental/reward_manager/*`` for reference.
204204
# your own reward manager
205205
...
206206
207-
After defining it, users can specify their custom reward manager by setting ``reward_model.reward_manager=user_costomized``.
207+
After defining it, users can specify their custom reward manager by setting ``reward_model.reward_manager.name=user_costomized``.
208208

209209
RewardLoopManager
210210
~~~~~~~~~~~~~~~~~

docs/ascend_tutorial/examples/dapo_multi_model_optimization_practice.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ DAPO的论文可以参考:[DAPO](https://arxiv.org/pdf/2503.14476),其中包
1515
在dapo算法中,必须配置成dapo。
1616

1717
```
18-
reward_model.reward_manager=dapo
18+
reward_model.reward_manager.name=dapo
1919
```
2020

2121
- **Clip-Higher 更高裁剪 **
@@ -250,7 +250,7 @@ ray job submit --no-wait --runtime-env="${RUNTIME_ENV}" \
250250
actor_rollout_ref.ref.fsdp_config.param_offload=True \
251251
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
252252
actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
253-
reward_model.reward_manager=dapo \
253+
reward_model.reward_manager.name=dapo \
254254
reward_model.overlong_buffer.enable=${enable_overlong_buffer} \
255255
reward_model.overlong_buffer.len=${overlong_buffer_len} \
256256
reward_model.overlong_buffer.penalty_factor=${overlong_penalty_factor} \

examples/gmpo_trainer/test_dapo_7b_math.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -117,7 +117,7 @@ python3 -m verl.trainer.main_ppo \
117117
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
118118
actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
119119
actor_rollout_ref.actor.checkpoint.save_contents="${save_contents}" \
120-
reward_model.reward_manager=dapo \
120+
reward_model.reward_manager.name=dapo \
121121
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
122122
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
123123
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/gmpo_trainer/test_dapo_qwen3_30b_math.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -113,7 +113,7 @@ python3 -m verl.trainer.main_ppo \
113113
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
114114
actor_rollout_ref.actor.fsdp_config.fsdp_size=${fsdp_size} \
115115
actor_rollout_ref.actor.checkpoint.save_contents="${save_contents}" \
116-
reward_model.reward_manager=dapo \
116+
reward_model.reward_manager.name=dapo \
117117
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
118118
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
119119
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/grpo_trainer/run_deepseek671b_math_megatron_96gb.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -159,7 +159,7 @@ python3 -m verl.trainer.main_ppo \
159159
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=False \
160160
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=False \
161161
+actor_rollout_ref.actor.megatron.override_transformer_config.num_layers_in_last_pipeline_stage=${LAST_LAYER} \
162-
reward_model.reward_manager=dapo \
162+
reward_model.reward_manager.name=dapo \
163163
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
164164
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
165165
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/grpo_trainer/run_qwen3-235b_megatron_96gb.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -161,7 +161,7 @@ python3 -m verl.trainer.main_ppo \
161161
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_enable_deepep=True \
162162
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_loss_in_pipeline_split=True \
163163
+actor_rollout_ref.actor.megatron.override_transformer_config.account_for_embedding_in_pipeline_split=True \
164-
reward_model.reward_manager=dapo \
164+
reward_model.reward_manager.name=dapo \
165165
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
166166
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
167167
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/grpo_trainer/run_qwen3moe-30b_megatron_96gb.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -176,7 +176,7 @@ python3 -m verl.trainer.main_ppo --config-path=./config --config-name='ppo_megat
176176
actor_rollout_ref.ref.megatron.context_parallel_size=${REF_CP} \
177177
actor_rollout_ref.ref.megatron.expert_model_parallel_size=${REF_EP} \
178178
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=${REF_ETP} \
179-
reward_model.reward_manager=dapo \
179+
reward_model.reward_manager.name=dapo \
180180
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
181181
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
182182
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/gspo_trainer/run_qwen30b_gspo.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -153,7 +153,7 @@ ROLLOUT_CONFIG="
153153

154154
# ===================================== Reward =====================================
155155
REWARD_CONFIG="
156-
reward_model.reward_manager=dapo \
156+
reward_model.reward_manager.name=dapo \
157157
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
158158
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
159159
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/gspo_trainer/test_gspo_3b_math.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ python3 -m verl.trainer.main_ppo \
173173
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
174174
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
175175
actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
176-
reward_model.reward_manager=${reward_manager} \
176+
reward_model.reward_manager.name=${reward_manager} \
177177
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
178178
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
179179
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

examples/gspo_trainer/test_gspo_3b_math_slurm.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -177,7 +177,7 @@ python3 -m verl.trainer.main_ppo \
177177
actor_rollout_ref.ref.fsdp_config.param_offload=${offload} \
178178
actor_rollout_ref.ref.ulysses_sequence_parallel_size=${sp_size} \
179179
actor_rollout_ref.actor.entropy_checkpointing=${entropy_checkpointing} \
180-
reward_model.reward_manager=${reward_manager} \
180+
reward_model.reward_manager.name=${reward_manager} \
181181
+reward_model.reward_kwargs.overlong_buffer_cfg.enable=${enable_overlong_buffer} \
182182
+reward_model.reward_kwargs.overlong_buffer_cfg.len=${overlong_buffer_len} \
183183
+reward_model.reward_kwargs.overlong_buffer_cfg.penalty_factor=${overlong_penalty_factor} \

0 commit comments

Comments
 (0)