-
Notifications
You must be signed in to change notification settings - Fork 48
Add decoupled evaluation workflow #142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @lingzhq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant enhancement to the evaluation framework by adding an optional, decoupled evaluation workflow. This allows for more objective and specialized model assessment, particularly for mathematical tasks, by separating the evaluation process from the standard rollout workflow. The changes include a new dedicated math evaluation workflow and a refactored, centralized set of utilities for robust answer parsing and comparison.
Highlights
- Decoupled Evaluation Workflow: Implemented an optional
default_eval_typeconfiguration field. When specified, the system will use a dedicated evaluation workflow instead of the default rollout workflow, enabling more precise and objective model assessment. - New
MathEvalWorkflow: Added a specializedMathEvalWorkflowthat adheres to the evaluation methodology of Qwen2.5-Math. This workflow is designed to accurately evaluate model performance on standard math benchmarks using aqwen_boxedformat. - Consolidated Math Evaluation Utilities: Introduced
math_eval_utils.py, a new module containing robust functions for answer extraction, parsing, and comparison, adapted from the Qwen2.5-Math evaluation script. This centralizes and enhances the mathematical evaluation logic, moving some functions previously ineval_utils.py. - Configuration and Integration: Updated various configuration files and the
FileReaderto support the newdefault_eval_typeand integrate theMathEvalWorkflowinto the existing system, ensuring it can be dynamically selected for evaluation tasks.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a decoupled evaluation workflow, enhancing the objectivity of model assessment. The new MathEvalWorkflow and math_eval_utils.py are excellent additions. The feedback focuses on minor refinements to enhance robustness, code clarity, and developer experience.
31d834e to
e12f4b7
Compare
|
/unittest-module-explorer |
Summary
Tests
Github Test Reporter by CTRF 💚 |
e12f4b7 to
67f8cd1
Compare
|
/unittest-module-explorer |
1 similar comment
|
/unittest-module-explorer |
Summary
Tests
Github Test Reporter by CTRF 💚 |
hiyuchang
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
How would one implement Am I understanding right that the eval workflow's For implementing |
|
Implement a new workflow that inherits the original workflow, whose |
|
So in essence - it would return a list with a single response object, right? (matching a single prompt as input to |
|
Yes, the newly implemented Workflow needs to aggregate the Experience generated by k calls and returns only one Experience. |
|
Does it make sense to extend the |
|
The |
|
Hmm, for multi-turn workflows, for processing several of initial rollouts, I'm guessing being able to send a batch of different prompts to vllm could sometimes achieve better utilization |
|
Batch can indeed improve performance, but it will reduce the flexibility of workflow writing, which is a trade-off. For scenarios that are more concerned with performance, we recommen using the |
Description
This PR addresses Issue #119 by introducing an optional
default_eval_typefield. This feature decouples the standard rollout workflow from a dedicated evaluation workflow, enabling a more objective assessment of the model's true capabilities.Key Features
default_eval_typein the configuration. If this field is omitted, the evaluation process remains unchanged. When specified, the system switches to a custom evaluation workflow during the evaluation phase.MathEvalWorkflow: Added an initialmath_eval_workflowthat follows the official implementation of Qwen2.5-Math. It uses theqwen_boxedformat to objectively and accurately evaluate the model's performance on standard math benchmarks.math_eval_utils.pyfile by modifying the official Qwen2.5-Math evaluation script. This module provides more powerful and robust functions for answer extraction, parsing, and comparison. It also consolidates some dependencies previously located in the originaleval_utils.py.Example
The experiment below was conducted by training on GSM8K, with other settings consistent with Issue #119 description. The two blue lines represent the results without using
default_eval_type, showing lower scores. The red line, which uses the newmath_eval_workflow, achieves scores consistent with official benchmarks, better reflecting the model's true performance.TODO LIst
default_eval_typeshould be a required field.eval_utilsCompatibility:eval_utilswith the more robust implementation frommath_eval_utils.Checklist
Please check the following items before code is ready to be reviewed.