-
Notifications
You must be signed in to change notification settings - Fork 810
[BREAKING] Refactor Scheduler and GRPOTrainer for Flexible Multi-Turn Training #5307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @hjh0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've undertaken a significant refactoring of the Scheduler
and GRPOTrainer
components to introduce greater flexibility and robustness for multi-turn training scenarios. The core intent is to enable more sophisticated reinforcement learning from human feedback (RLHF) algorithms that require iterative interactions. This involves a new, standardized RolloutOutput
protocol for inference results and a redesigned multi-turn scheduling system. These changes streamline the data flow, improve compatibility with advanced training techniques, and enhance the overall extensibility of the framework.
Highlights
- Refactored Scheduler Class for Enhanced Flexibility: The
Scheduler
class has been refactored and repositioned above the rollout engine layer. This change provides significantly more flexibility for implementing complex multi-turn inference logic, allowing for dynamic numbers of rollout results and supporting more versatile multi-round training algorithms. It now standardizes on a newRolloutOutput
class for inference results, improving extensibility for various environment return values. - Updated GRPOTrainer for Compatibility and Efficiency: The
GRPOTrainer
has been updated to be fully compatible with the refactoredScheduler
class. It now consistently uses the newRolloutOutput
for handling and parsing rollout results. A key improvement is the default use of rollout token IDs during training, which avoids repeated encoding and streamlines the training process. The trainer also supports variable numbers of rollout results, aligning with the new multi-turn training algorithms, and features improved code readability through finer method segmentation and more detailed docstrings. - Standardized Rollout Output with New Protocol: A new
RolloutOutput
Pydantic model has been introduced as a standardized protocol for representing rollout inference results. This model encapsulates theChatCompletionResponse
, conversationmessages
,response_token_ids
,response_loss_mask
, and additionalrollout_infos
. This structured approach replaces previous ad-hoc response choices (RolloutResponseChoice
,GymRolloutResponseChoice
), ensuring better data integrity, serialization, and extensibility. - Improved Multi-Turn Scheduler Hierarchy: The multi-turn scheduling mechanism has been re-architected with the introduction of
RolloutScheduler
as a base class for single-turn rollouts, andMultiTurnScheduler
inheriting from it for multi-turn specific logic. This allows for clearer separation of concerns and provides flexible customization points throughstep
andcheck_finished
methods, enabling developers to define custom turn transitions and termination conditions. - Direct Use of Rollout Token IDs in Training: The training process within
GRPOTrainer
now directly utilizes token IDs from theRolloutOutput
for loss computation. This change is crucial for avoiding redundant tokenization during training and enables fine-grained control over loss masking for specific tokens, which is particularly useful in complex multi-turn scenarios where certain parts of the generated response might need to be excluded from loss calculation.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant and beneficial refactoring of the Scheduler
and GRPOTrainer
to support more flexible multi-turn training. The introduction of the RolloutScheduler
and RolloutOutput
classes greatly improves modularity and extensibility. The code is now cleaner and easier to follow. However, I've identified several critical issues, including a NameError
, incorrect return types, and improper data parsing, that need to be addressed before this can be merged. Please see my detailed comments below.
Refactor Scheduler Class
Refactor GRPOTrainer