[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

hjh0119 · 2025-08-08T08:22:08Z

Refactor Scheduler Class

The Scheduler class is now positioned above the rollout engine layer, offering greater flexibility for multi-turn inference logic.
Supports dynamic numbers of rollout results, making it easier to develop more versatile multi-round training algorithms.
Adopts a more standardized RolloutOutput class for rollout inference results, which improves extensibility for additional environment return values.

Refactor GRPOTrainer

Updated to be compatible with the refactored Scheduler class, and uses RolloutOutput for rollout results and parsing.
Training now uses rollout token IDs as default to avoid repeated encoding.
Supports variable numbers of rollout results (compatible with multi-turn training algorithms).
Improves code readability: methods are more finely segmented and more detailed docstrings are provided.

gemini-code-assist

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've undertaken a significant refactoring of the Scheduler and GRPOTrainer components to introduce greater flexibility and robustness for multi-turn training scenarios. The core intent is to enable more sophisticated reinforcement learning from human feedback (RLHF) algorithms that require iterative interactions. This involves a new, standardized RolloutOutput protocol for inference results and a redesigned multi-turn scheduling system. These changes streamline the data flow, improve compatibility with advanced training techniques, and enhance the overall extensibility of the framework.

Highlights

Refactored Scheduler Class for Enhanced Flexibility: The Scheduler class has been refactored and repositioned above the rollout engine layer. This change provides significantly more flexibility for implementing complex multi-turn inference logic, allowing for dynamic numbers of rollout results and supporting more versatile multi-round training algorithms. It now standardizes on a new RolloutOutput class for inference results, improving extensibility for various environment return values.
Updated GRPOTrainer for Compatibility and Efficiency: The GRPOTrainer has been updated to be fully compatible with the refactored Scheduler class. It now consistently uses the new RolloutOutput for handling and parsing rollout results. A key improvement is the default use of rollout token IDs during training, which avoids repeated encoding and streamlines the training process. The trainer also supports variable numbers of rollout results, aligning with the new multi-turn training algorithms, and features improved code readability through finer method segmentation and more detailed docstrings.
Standardized Rollout Output with New Protocol: A new RolloutOutput Pydantic model has been introduced as a standardized protocol for representing rollout inference results. This model encapsulates the ChatCompletionResponse, conversation messages, response_token_ids, response_loss_mask, and additional rollout_infos. This structured approach replaces previous ad-hoc response choices (RolloutResponseChoice, GymRolloutResponseChoice), ensuring better data integrity, serialization, and extensibility.
Improved Multi-Turn Scheduler Hierarchy: The multi-turn scheduling mechanism has been re-architected with the introduction of RolloutScheduler as a base class for single-turn rollouts, and MultiTurnScheduler inheriting from it for multi-turn specific logic. This allows for clearer separation of concerns and provides flexible customization points through step and check_finished methods, enabling developers to define custom turn transitions and termination conditions.
Direct Use of Rollout Token IDs in Training: The training process within GRPOTrainer now directly utilizes token IDs from the RolloutOutput for loss computation. This change is crucial for avoiding redundant tokenization during training and enables fine-grained control over loss masking for specific tokens, which is particularly useful in complex multi-turn scenarios where certain parts of the generated response might need to be excluded from loss calculation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant and beneficial refactoring of the Scheduler and GRPOTrainer to support more flexible multi-turn training. The introduction of the RolloutScheduler and RolloutOutput classes greatly improves modularity and extensibility. The code is now cleaner and easier to follow. However, I've identified several critical issues, including a NameError, incorrect return types, and improper data parsing, that need to be addressed before this can be merged. Please see my detailed comments below.

swift/plugin/multi_turn.py

swift/trainers/rlhf_trainer/vllm_client.py

swift/trainers/rlhf_trainer/grpo_trainer.py

hjh0119 added 19 commits July 29, 2025 17:19

wip

b27fb9b

wip

3a9f23c

revert prompt_ids

061a2d5

Merge remote-tracking branch 'origin' into grpo-use-ids

b1ac613

remove tokenizer in reward

8d2b170

Merge remote-tracking branch 'origin' into grpo-use-ids

3da665b

encode ids

0dea74d

wip replace ids

1114cea

fix adv

ebf1b35

Merge remote-tracking branch 'origin' into grpo-use-ids

67d124a

wip

ffcd9b4

wip

05ae143

wip

fdb1ae8

wip

725e1f6

wip

dcc9052

wip

f72712e

refactor v1

a8aeb73

rename completion id

8555f31

Merge remote-tracking branch 'origin' into grpo-use-ids

4b30574

gemini-code-assist bot reviewed Aug 8, 2025

View reviewed changes

hjh0119 added 9 commits August 8, 2025 17:12

fix typo & bugs

4c0ada9

compute loss for dynamic batch size

b9e4c04

Merge remote-tracking branch 'origin' into grpo-use-ids

6ad2700

fix tiny bugs

eea4485

dynamic rollout advantages

5b927de

fix score_completions

b0c52b7

fix split mini batch

e25c2e4

docstring for split mini batches

bf035b3

fix gather device

60fb903

fix rollout async infer

5108fce

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

Uh oh!

hjh0119 commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

Are you sure you want to change the base?

[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

Uh oh!

Conversation

hjh0119 commented Aug 8, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!