Skip to content

Introduce minimal generation backend interface for GRPO and RLOO trainers#5244

Open
rycerzes wants to merge 5 commits intohuggingface:mainfrom
rycerzes:fix-5193-minimal-interface
Open

Introduce minimal generation backend interface for GRPO and RLOO trainers#5244
rycerzes wants to merge 5 commits intohuggingface:mainfrom
rycerzes:fix-5193-minimal-interface

Conversation

@rycerzes
Copy link
Contributor

@rycerzes rycerzes commented Mar 8, 2026

Summary

Closes #5193 (step 2 of #5119): introduces a GenerationBackend protocol in trl/generation/backend.py and refactors GRPOTrainer and RLOOTrainer to dispatch through it, eliminating inline backend if/elif/else branches.

Changes

  • trl/generation/backend.py (new): GenerationBackend protocol, GenerationResult dataclass, VLLMBackendAdapter, TransformersPagedBackendAdapter, TransformersBackendAdapter, and create_generation_backend() factory
  • GRPOTrainer / RLOOTrainer: wired to self.generation_backend at init; _generate_single_turn reduced to orchestration - all backend-specific logic moved to adapters
  • Tests: new tests/test_generation_backend.py; additional cases in test_grpo_trainer.py and test_rloo_trainer.py

Preserved

All existing invariants: _last_loaded_step sync semantics, return contracts, rollout_func dispatch, and public API are unchanged.

Unblocks #5194, #5195.
CC: @albertvillanova


Note

Cursor Bugbot is generating a summary for commit 1862c15. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


# Use the base Trainer input preparation path, not trainer-specific overrides
# like GRPO/RLOO _prepare_inputs, to avoid recursive generation.
base_prepare_inputs = super(type(trainer), trainer)._prepare_inputs
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super(type(trainer), trainer) breaks for trainer subclasses

Medium Severity

super(type(trainer), trainer)._prepare_inputs is a well-known Python anti-pattern that breaks under inheritance. If a user subclasses GRPOTrainer or RLOOTrainer, type(trainer) returns the subclass, so super() resolves to the trainer's own _prepare_inputs (which triggers _generate_and_score_completions) instead of the base Trainer._prepare_inputs. This causes infinite recursion — the exact scenario the comment on line 287-288 warns about. The old code used super()._prepare_inputs inside the trainer method, which hardcodes the class to skip regardless of subclassing.

Fix in Cursor Fix in Web

Copy link
Member

@albertvillanova albertvillanova Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this report by Cursor is a real issue: #5244 (comment)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing the decoupling of inference backend from rollout & agent logic.

First of all, a quick note on ongoing changes: note that both GRPO/RLOO .generate_single_turn and .generate are currently undergoing a significant refactoring to fix a bug related to re-tokenization in GRPO multi-turn tool calling. For example, rollout_func handling was moved up to .generate. See:

Because of this, there may be some overlap or potential conflicts with the current changes, so it would be good to take that into account when shaping the final design.

Second, regarding the proposed abstraction: my understanding is that the project's architecture guidelines generally try to avoid introducing additional abstraction layers unless they clearly simplify the system or enable new capabilities.

  • I tag @qgallouedec, as he has a stronger opinion on this principle and may want to weigh in

Given that, I don't know, what about a starting PR with a simpler approach? Maybe the backend branches could initially just be extracted as plain functions with explicit parameters, rather than wrapped in stateful adapter classes behind a factory. This could help keep the design simpler and make it easier to iterate before committing to a more structured abstraction layer if it proves necessary.

What do you think? On the other hand, maybe you already have a clearer view of the required level of abstraction.


# Use the base Trainer input preparation path, not trainer-specific overrides
# like GRPO/RLOO _prepare_inputs, to avoid recursive generation.
base_prepare_inputs = super(type(trainer), trainer)._prepare_inputs
Copy link
Member

@albertvillanova albertvillanova Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this report by Cursor is a real issue: #5244 (comment)

@qgallouedec
Copy link
Member

qgallouedec commented Mar 10, 2026

Second, regarding the proposed abstraction: my understanding is that the project's architecture guidelines generally try to avoid introducing additional abstraction layers unless they clearly simplify the system or enable new capabilities.

yes, here it feel like we're creating a new abstraction in case we need additional backends, which is not planned nor discussed atm. We may need such thing in the future, but let's try not solve issues that don't exist yet

@rycerzes
Copy link
Contributor Author

Thanks for the feedback @albertvillanova @qgallouedec, it makes the direction clear.

The Protocol + stateful adapter + factory adds a layer that isn't justified by the current set of backends. Plain functions with explicit parameters would be the right shape here. My main goal was to let openenv/utils.py call into generation without reintroducing the if/elif flag checks, but I may have over-abstracted to get there.
And #5224 series is reshaping the code this PR touches.

My plan is to wait for the #5242 chain to fully settle on main, then rebase and rework this PR and #5256 as plain function extractions, dispatch bodies in module-level functions with explicit parameters, if/elif stays in the trainer, no adapter classes. The openenv/utils.py coupling gets resolved by passing vllm_generation directly, not a method on an adapter. Will open a revised draft once #5242 lands.

Will update here once the chain lands.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Introduce a Minimal Backend Generation Interface

3 participants