async streaming grpo w prefetch by winglian · Pull Request #5250 · huggingface/trl

winglian · 2026-03-09T20:03:02Z

What does this PR do?

companion PR to huggingface/transformers#44259

we may have to require transformers>v5.3.0, depending on when the companion transformers PR is merged.

alongside #5249, there are significant speedups from this.

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

Note

High Risk
Refactors GRPO’s core online rollout/training flow to use transformers DataProducer with background-thread prefetch and optional streaming scoring, which is concurrency- and training-correctness sensitive (staleness, synchronization, metrics/state updates). Changes also affect vLLM weight syncing and importance-sampling application paths.

Overview
Adds an optional transformers DataProducer-based rollout pipeline for GRPOTrainer, enabling async rollout prefetching and an optional streaming partial-batch mode that incrementally scores prompt groups and yields micro-batches to overlap reward computation with policy logprob scoring.

Introduces GRPODataProducer/RolloutDataset and updates GRPOTrainer to create/wire the producer, handle deferred scoring when rollouts are produced on a background thread (including vLLM sync scheduling via vllm_sync_interval), and bypass the legacy _prepare_inputs buffering path when the producer is enabled. Adds new config flags (use_data_producer, async_prefetch, prefetch_depth, streaming_partial_batch, streaming_min_groups, vllm_sync_interval) and tightens loss application to only use importance_sampling_ratio when present.

^{Written by Cursor Bugbot for commit 91388df. This will update automatically on new commits. Configure here.}

trl/trainer/grpo_trainer.py

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

cursor · 2026-03-09T21:13:03Z

trl/trainer/grpo_trainer.py

+
+        # ---- Shuffle (deferred from produce()) ----
+        shuffled = shuffle_sequence_dict(data)
+        dataset._data = shuffled


Deferred scoring path missing generation and completion metrics

Medium Severity

_compute_deferred_scores logs reward metrics but omits all generation metrics: num_input_tokens_seen, completions/mean_length, completions/clipped_ratio, completions/mean_terminated_length, IS sampling stats, and self._logs for completion logging. The _generate method skips these on background threads, and the streaming path _compute_streaming_group_scores logs them in its is_last_chunk block — but the non-streaming deferred path never recovers them.

async streaming grpo w prefetch

21de817

cursor bot reviewed Mar 9, 2026

View reviewed changes

trl/trainer/grpo_trainer.py Show resolved Hide resolved

trl/trainer/grpo_trainer.py Show resolved Hide resolved

trl/trainer/grpo_trainer.py Show resolved Hide resolved

trl/trainer/grpo_trainer.py Show resolved Hide resolved

winglian added 2 commits March 9, 2026 17:08

address PR feedback

2b4276c

remove the loop and use maths

91388df

cursor bot reviewed Mar 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async streaming grpo w prefetch#5250

async streaming grpo w prefetch#5250
winglian wants to merge 3 commits intohuggingface:mainfrom
winglian:async-streaming-grpo

winglian commented Mar 9, 2026 •

edited by cursor bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

winglian commented Mar 9, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 9, 2026

Choose a reason for hiding this comment

Deferred scoring path missing generation and completion metrics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

winglian commented Mar 9, 2026 •

edited by cursor bot

Loading