Introduce minimal generation backend interface for GRPO and RLOO trainers#5244
Introduce minimal generation backend interface for GRPO and RLOO trainers#5244rycerzes wants to merge 5 commits intohuggingface:mainfrom
Conversation
…neration and sync weights
- prevent recursive generation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
|
|
||
| # Use the base Trainer input preparation path, not trainer-specific overrides | ||
| # like GRPO/RLOO _prepare_inputs, to avoid recursive generation. | ||
| base_prepare_inputs = super(type(trainer), trainer)._prepare_inputs |
There was a problem hiding this comment.
super(type(trainer), trainer) breaks for trainer subclasses
Medium Severity
super(type(trainer), trainer)._prepare_inputs is a well-known Python anti-pattern that breaks under inheritance. If a user subclasses GRPOTrainer or RLOOTrainer, type(trainer) returns the subclass, so super() resolves to the trainer's own _prepare_inputs (which triggers _generate_and_score_completions) instead of the base Trainer._prepare_inputs. This causes infinite recursion — the exact scenario the comment on line 287-288 warns about. The old code used super()._prepare_inputs inside the trainer method, which hardcodes the class to skip regardless of subclassing.
There was a problem hiding this comment.
I think this report by Cursor is a real issue: #5244 (comment)
albertvillanova
left a comment
There was a problem hiding this comment.
Thanks a lot for addressing the decoupling of inference backend from rollout & agent logic.
First of all, a quick note on ongoing changes: note that both GRPO/RLOO .generate_single_turn and .generate are currently undergoing a significant refactoring to fix a bug related to re-tokenization in GRPO multi-turn tool calling. For example, rollout_func handling was moved up to .generate. See:
Because of this, there may be some overlap or potential conflicts with the current changes, so it would be good to take that into account when shaping the final design.
Second, regarding the proposed abstraction: my understanding is that the project's architecture guidelines generally try to avoid introducing additional abstraction layers unless they clearly simplify the system or enable new capabilities.
- I tag @qgallouedec, as he has a stronger opinion on this principle and may want to weigh in
Given that, I don't know, what about a starting PR with a simpler approach? Maybe the backend branches could initially just be extracted as plain functions with explicit parameters, rather than wrapped in stateful adapter classes behind a factory. This could help keep the design simpler and make it easier to iterate before committing to a more structured abstraction layer if it proves necessary.
What do you think? On the other hand, maybe you already have a clearer view of the required level of abstraction.
|
|
||
| # Use the base Trainer input preparation path, not trainer-specific overrides | ||
| # like GRPO/RLOO _prepare_inputs, to avoid recursive generation. | ||
| base_prepare_inputs = super(type(trainer), trainer)._prepare_inputs |
There was a problem hiding this comment.
I think this report by Cursor is a real issue: #5244 (comment)
yes, here it feel like we're creating a new abstraction in case we need additional backends, which is not planned nor discussed atm. We may need such thing in the future, but let's try not solve issues that don't exist yet |
|
Thanks for the feedback @albertvillanova @qgallouedec, it makes the direction clear. The My plan is to wait for the #5242 chain to fully settle on main, then rebase and rework this PR and #5256 as plain function extractions, dispatch bodies in module-level functions with explicit parameters, if/elif stays in the trainer, no adapter classes. The openenv/utils.py coupling gets resolved by passing vllm_generation directly, not a method on an adapter. Will open a revised draft once #5242 lands. Will update here once the chain lands. |


Summary
Closes #5193 (step 2 of #5119): introduces a
GenerationBackendprotocol intrl/generation/backend.pyand refactorsGRPOTrainerandRLOOTrainerto dispatch through it, eliminating inline backendif/elif/elsebranches.Changes
trl/generation/backend.py(new):GenerationBackendprotocol,GenerationResultdataclass,VLLMBackendAdapter,TransformersPagedBackendAdapter,TransformersBackendAdapter, andcreate_generation_backend()factoryGRPOTrainer/RLOOTrainer: wired toself.generation_backendat init;_generate_single_turnreduced to orchestration - all backend-specific logic moved to adapterstests/test_generation_backend.py; additional cases intest_grpo_trainer.pyandtest_rloo_trainer.pyPreserved
All existing invariants:
_last_loaded_stepsync semantics, return contracts,rollout_funcdispatch, and public API are unchanged.Unblocks #5194, #5195.
CC: @albertvillanova
Note
Cursor Bugbot is generating a summary for commit 1862c15. Configure here.