Skip to content

[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 29 commits into
base: main
Choose a base branch
from

Conversation

hjh0119
Copy link
Collaborator

@hjh0119 hjh0119 commented Aug 8, 2025

Refactor Scheduler Class

  • The Scheduler class is now positioned above the rollout engine layer, offering greater flexibility for multi-turn inference logic.
  • Supports dynamic numbers of rollout results, making it easier to develop more versatile multi-round training algorithms.
  • Adopts a more standardized RolloutOutput class for rollout inference results, which improves extensibility for additional environment return values.

Refactor GRPOTrainer

  • Updated to be compatible with the refactored Scheduler class, and uses RolloutOutput for rollout results and parsing.
  • Training now uses rollout token IDs as default to avoid repeated encoding.
  • Supports variable numbers of rollout results (compatible with multi-turn training algorithms).
  • Improves code readability: methods are more finely segmented and more detailed docstrings are provided.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've undertaken a significant refactoring of the Scheduler and GRPOTrainer components to introduce greater flexibility and robustness for multi-turn training scenarios. The core intent is to enable more sophisticated reinforcement learning from human feedback (RLHF) algorithms that require iterative interactions. This involves a new, standardized RolloutOutput protocol for inference results and a redesigned multi-turn scheduling system. These changes streamline the data flow, improve compatibility with advanced training techniques, and enhance the overall extensibility of the framework.

Highlights

  • Refactored Scheduler Class for Enhanced Flexibility: The Scheduler class has been refactored and repositioned above the rollout engine layer. This change provides significantly more flexibility for implementing complex multi-turn inference logic, allowing for dynamic numbers of rollout results and supporting more versatile multi-round training algorithms. It now standardizes on a new RolloutOutput class for inference results, improving extensibility for various environment return values.
  • Updated GRPOTrainer for Compatibility and Efficiency: The GRPOTrainer has been updated to be fully compatible with the refactored Scheduler class. It now consistently uses the new RolloutOutput for handling and parsing rollout results. A key improvement is the default use of rollout token IDs during training, which avoids repeated encoding and streamlines the training process. The trainer also supports variable numbers of rollout results, aligning with the new multi-turn training algorithms, and features improved code readability through finer method segmentation and more detailed docstrings.
  • Standardized Rollout Output with New Protocol: A new RolloutOutput Pydantic model has been introduced as a standardized protocol for representing rollout inference results. This model encapsulates the ChatCompletionResponse, conversation messages, response_token_ids, response_loss_mask, and additional rollout_infos. This structured approach replaces previous ad-hoc response choices (RolloutResponseChoice, GymRolloutResponseChoice), ensuring better data integrity, serialization, and extensibility.
  • Improved Multi-Turn Scheduler Hierarchy: The multi-turn scheduling mechanism has been re-architected with the introduction of RolloutScheduler as a base class for single-turn rollouts, and MultiTurnScheduler inheriting from it for multi-turn specific logic. This allows for clearer separation of concerns and provides flexible customization points through step and check_finished methods, enabling developers to define custom turn transitions and termination conditions.
  • Direct Use of Rollout Token IDs in Training: The training process within GRPOTrainer now directly utilizes token IDs from the RolloutOutput for loss computation. This change is crucial for avoiding redundant tokenization during training and enables fine-grained control over loss masking for specific tokens, which is particularly useful in complex multi-turn scenarios where certain parts of the generated response might need to be excluded from loss calculation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and beneficial refactoring of the Scheduler and GRPOTrainer to support more flexible multi-turn training. The introduction of the RolloutScheduler and RolloutOutput classes greatly improves modularity and extensibility. The code is now cleaner and easier to follow. However, I've identified several critical issues, including a NameError, incorrect return types, and improper data parsing, that need to be addressed before this can be merged. Please see my detailed comments below.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant