Skip to content

async streaming grpo w prefetch#5250

Open
winglian wants to merge 3 commits intohuggingface:mainfrom
winglian:async-streaming-grpo
Open

async streaming grpo w prefetch#5250
winglian wants to merge 3 commits intohuggingface:mainfrom
winglian:async-streaming-grpo

Conversation

@winglian
Copy link
Contributor

@winglian winglian commented Mar 9, 2026

What does this PR do?

companion PR to huggingface/transformers#44259

we may have to require transformers>v5.3.0, depending on when the companion transformers PR is merged.

Screenshot 2026-03-09 at 12 56 41 PM

alongside #5249, there are significant speedups from this.

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.


Note

High Risk
Refactors GRPO’s core online rollout/training flow to use transformers DataProducer with background-thread prefetch and optional streaming scoring, which is concurrency- and training-correctness sensitive (staleness, synchronization, metrics/state updates). Changes also affect vLLM weight syncing and importance-sampling application paths.

Overview
Adds an optional transformers DataProducer-based rollout pipeline for GRPOTrainer, enabling async rollout prefetching and an optional streaming partial-batch mode that incrementally scores prompt groups and yields micro-batches to overlap reward computation with policy logprob scoring.

Introduces GRPODataProducer/RolloutDataset and updates GRPOTrainer to create/wire the producer, handle deferred scoring when rollouts are produced on a background thread (including vLLM sync scheduling via vllm_sync_interval), and bypass the legacy _prepare_inputs buffering path when the producer is enabled. Adds new config flags (use_data_producer, async_prefetch, prefetch_depth, streaming_partial_batch, streaming_min_groups, vllm_sync_interval) and tightens loss application to only use importance_sampling_ratio when present.

Written by Cursor Bugbot for commit 91388df. This will update automatically on new commits. Configure here.

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.


# ---- Shuffle (deferred from produce()) ----
shuffled = shuffle_sequence_dict(data)
dataset._data = shuffled
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deferred scoring path missing generation and completion metrics

Medium Severity

_compute_deferred_scores logs reward metrics but omits all generation metrics: num_input_tokens_seen, completions/mean_length, completions/clipped_ratio, completions/mean_terminated_length, IS sampling stats, and self._logs for completion logging. The _generate method skips these on background threads, and the streaming path _compute_streaming_group_scores logs them in its is_last_chunk block — but the non-streaming deferred path never recovers them.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant