Skip to content

Conversation

@lingzhq
Copy link
Collaborator

@lingzhq lingzhq commented Jul 24, 2025

Description

This PR addresses Issue #119 by introducing an optional default_eval_type field. This feature decouples the standard rollout workflow from a dedicated evaluation workflow, enabling a more objective assessment of the model's true capabilities.

Key Features

  • Decoupled Evaluation Workflow: Introduced default_eval_type in the configuration. If this field is omitted, the evaluation process remains unchanged. When specified, the system switches to a custom evaluation workflow during the evaluation phase.
  • New MathEvalWorkflow: Added an initial math_eval_workflow that follows the official implementation of Qwen2.5-Math. It uses the qwen_boxed format to objectively and accurately evaluate the model's performance on standard math benchmarks.
  • Enhanced math_eval_utils.py: Created a new math_eval_utils.py file by modifying the official Qwen2.5-Math evaluation script. This module provides more powerful and robust functions for answer extraction, parsing, and comparison. It also consolidates some dependencies previously located in the original eval_utils.py.

Example

The experiment below was conducted by training on GSM8K, with other settings consistent with Issue #119 description. The two blue lines represent the results without using default_eval_type, showing lower scores. The red line, which uses the new math_eval_workflow, achieves scores consistent with official benchmarks, better reflecting the model's true performance.

eval-workflow-res

TODO LIst

  • Enhance eval_workflow Customization:
    • Allow custom evaluation prompts to align with rollout formats or support other evaluation languages.
    • Enable customization of generation configs, which are currently set to recommended official values.
    • Design and implement Pass@k evaluation, a metric of growing interest in RFT.
    • Add an option to limit the evaluation set size for faster feedback.
  • Improve Workflow Integration:
    • Consider whether default_eval_type should be a required field.
    • Investigate how to retain additional evaluation metrics, such as format scores.
    • Extend support to workflows beyond mathematics.
  • eval_utils Compatibility:
    • Explore replacing the original answer comparison logic in eval_utils with the more robust implementation from math_eval_utils.

Checklist

Please check the following items before code is ready to be reviewed.

  • Code has passed all tests
  • Docstrings have been added/updated in Google Style
  • Documentation has been updated
  • Code is ready for review

@lingzhq lingzhq requested review from pan-x-c and yxdyc July 24, 2025 10:13
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @lingzhq, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the evaluation framework by adding an optional, decoupled evaluation workflow. This allows for more objective and specialized model assessment, particularly for mathematical tasks, by separating the evaluation process from the standard rollout workflow. The changes include a new dedicated math evaluation workflow and a refactored, centralized set of utilities for robust answer parsing and comparison.

Highlights

  • Decoupled Evaluation Workflow: Implemented an optional default_eval_type configuration field. When specified, the system will use a dedicated evaluation workflow instead of the default rollout workflow, enabling more precise and objective model assessment.
  • New MathEvalWorkflow: Added a specialized MathEvalWorkflow that adheres to the evaluation methodology of Qwen2.5-Math. This workflow is designed to accurately evaluate model performance on standard math benchmarks using a qwen_boxed format.
  • Consolidated Math Evaluation Utilities: Introduced math_eval_utils.py, a new module containing robust functions for answer extraction, parsing, and comparison, adapted from the Qwen2.5-Math evaluation script. This centralizes and enhances the mathematical evaluation logic, moving some functions previously in eval_utils.py.
  • Configuration and Integration: Updated various configuration files and the FileReader to support the new default_eval_type and integrate the MathEvalWorkflow into the existing system, ensuring it can be dynamically selected for evaluation tasks.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a decoupled evaluation workflow, enhancing the objectivity of model assessment. The new MathEvalWorkflow and math_eval_utils.py are excellent additions. The feedback focuses on minor refinements to enhance robustness, code clarity, and developer experience.

@lingzhq lingzhq force-pushed the dev/eval_workflow branch from 31d834e to e12f4b7 Compare July 29, 2025 13:35
@yxdyc yxdyc requested a review from chenyushuo July 30, 2025 06:43
@pan-x-c
Copy link
Collaborator

pan-x-c commented Aug 1, 2025

/unittest-module-explorer

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
19 19 0 0 0 0 329ms

Tests

Test Name Status Flaky Duration
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer 1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer 94ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer 96ms
tests/explorer/explorer_test.py::TestExplorerWithAddStrategy::test_explorer 52ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations 4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results 20ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods 14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop 8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks 8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all 8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch 13ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable 1ms

Github Test Reporter by CTRF 💚

@lingzhq lingzhq force-pushed the dev/eval_workflow branch from e12f4b7 to 67f8cd1 Compare August 1, 2025 07:18
@hiyuchang
Copy link
Collaborator

/unittest-module-explorer

1 similar comment
@chenyushuo
Copy link
Collaborator

/unittest-module-explorer

@github-actions
Copy link

github-actions bot commented Aug 1, 2025

Summary

Tests 📝 Passed ✅ Failed ❌ Skipped ⏭️ Other ❓ Flaky 🍂 Duration ⏱️
20 20 0 0 0 0 347ms

Tests

Test Name Status Flaky Duration
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer 1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer 105ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer 96ms
tests/explorer/explorer_test.py::TestExplorerWithAddStrategy::test_explorer 52ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations 4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results 20ms
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution 4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods 14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop 8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks 8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all 7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch 13ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow 1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable 1ms

Github Test Reporter by CTRF 💚

Copy link
Collaborator

@hiyuchang hiyuchang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@hiyuchang hiyuchang merged commit 9fc6e47 into agentscope-ai:main Aug 1, 2025
1 check passed
@vadimkantorov
Copy link
Contributor

  • Design and implement Pass@k evaluation, a metric of growing interest in RFT.

How would one implement Pass@k in the current Trinity design?

Am I understanding right that the eval workflow's run() still returns a list of multiple responses for a given input prompt?

For implementing Pass@k aggregation, I'd say that it should return a single response for a given input prompt? (and maybe still preserve the K responses inside of it)

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 11, 2025

Implement a new workflow that inherits the original workflow, whose run method executes the parent class run method k times and collects the return value.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Sep 11, 2025

So in essence - it would return a list with a single response object, right? (matching a single prompt as input to run())

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 11, 2025

Yes, the newly implemented Workflow needs to aggregate the Experience generated by k calls and returns only one Experience.

@vadimkantorov
Copy link
Contributor

Does it make sense to extend the InferenceModel interface to allow passing to chat() a batch of prompts?

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 11, 2025

The chat interface is used to simulate openai.chat.completions, which only allows one message list. If you want to pass a batch of prompts, you can consider manually applying the chat_template to the message list and then calling the generate method.

@vadimkantorov
Copy link
Contributor

Hmm, for multi-turn workflows, for processing several of initial rollouts, I'm guessing being able to send a batch of different prompts to vllm could sometimes achieve better utilization

@pan-x-c
Copy link
Collaborator

pan-x-c commented Sep 12, 2025

Batch can indeed improve performance, but it will reduce the flexibility of workflow writing, which is a trade-off. For scenarios that are more concerned with performance, we recommen using the generate interface, which supports batch prompts and can achieve better performance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants