Add decoupled evaluation workflow #142

lingzhq · 2025-07-24T10:13:09Z

Description

This PR addresses Issue #119 by introducing an optional default_eval_type field. This feature decouples the standard rollout workflow from a dedicated evaluation workflow, enabling a more objective assessment of the model's true capabilities.

Key Features

Decoupled Evaluation Workflow: Introduced default_eval_type in the configuration. If this field is omitted, the evaluation process remains unchanged. When specified, the system switches to a custom evaluation workflow during the evaluation phase.
New MathEvalWorkflow: Added an initial math_eval_workflow that follows the official implementation of Qwen2.5-Math. It uses the qwen_boxed format to objectively and accurately evaluate the model's performance on standard math benchmarks.
Enhanced math_eval_utils.py: Created a new math_eval_utils.py file by modifying the official Qwen2.5-Math evaluation script. This module provides more powerful and robust functions for answer extraction, parsing, and comparison. It also consolidates some dependencies previously located in the original eval_utils.py.

Example

The experiment below was conducted by training on GSM8K, with other settings consistent with Issue #119 description. The two blue lines represent the results without using default_eval_type, showing lower scores. The red line, which uses the new math_eval_workflow, achieves scores consistent with official benchmarks, better reflecting the model's true performance.

TODO LIst

Enhance eval_workflow Customization:
- Allow custom evaluation prompts to align with rollout formats or support other evaluation languages.
- Enable customization of generation configs, which are currently set to recommended official values.
- Design and implement Pass@k evaluation, a metric of growing interest in RFT.
- Add an option to limit the evaluation set size for faster feedback.
Improve Workflow Integration:
- Consider whether default_eval_type should be a required field.
- Investigate how to retain additional evaluation metrics, such as format scores.
- Extend support to workflows beyond mathematics.
eval_utils Compatibility:
- Explore replacing the original answer comparison logic in eval_utils with the more robust implementation from math_eval_utils.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist

Summary of Changes

Hello @lingzhq, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant enhancement to the evaluation framework by adding an optional, decoupled evaluation workflow. This allows for more objective and specialized model assessment, particularly for mathematical tasks, by separating the evaluation process from the standard rollout workflow. The changes include a new dedicated math evaluation workflow and a refactored, centralized set of utilities for robust answer parsing and comparison.

Highlights

Decoupled Evaluation Workflow: Implemented an optional default_eval_type configuration field. When specified, the system will use a dedicated evaluation workflow instead of the default rollout workflow, enabling more precise and objective model assessment.
New MathEvalWorkflow: Added a specialized MathEvalWorkflow that adheres to the evaluation methodology of Qwen2.5-Math. This workflow is designed to accurately evaluate model performance on standard math benchmarks using a qwen_boxed format.
Consolidated Math Evaluation Utilities: Introduced math_eval_utils.py, a new module containing robust functions for answer extraction, parsing, and comparison, adapted from the Qwen2.5-Math evaluation script. This centralizes and enhances the mathematical evaluation logic, moving some functions previously in eval_utils.py.
Configuration and Integration: Updated various configuration files and the FileReader to support the new default_eval_type and integrate the MathEvalWorkflow into the existing system, ensuring it can be dynamically selected for evaluation tasks.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a decoupled evaluation workflow, enhancing the objectivity of model assessment. The new MathEvalWorkflow and math_eval_utils.py are excellent additions. The feedback focuses on minor refinements to enhance robustness, code clarity, and developer experience.

tests/utils/eval_utils_test.py

trinity/common/workflows/eval_workflow.py

trinity/utils/math_eval_utils.py

pan-x-c · 2025-08-01T02:33:22Z

/unittest-module-explorer

github-actions · 2025-08-01T02:39:30Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
19	19	0	0	0	0	329ms

Tests

Test Name	Status	Duration
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer	✅	1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	✅	94ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer	✅	96ms
tests/explorer/explorer_test.py::TestExplorerWithAddStrategy::test_explorer	✅	52ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	20ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	13ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable	✅	1ms

Github Test Reporter by CTRF 💚

tests/template/config.yaml

trinity/explorer/explorer.py

tests/test_data/template.yaml

trinity/buffer/reader/file_reader.py

trinity/common/config.py

pyproject.toml

hiyuchang · 2025-08-01T07:52:22Z

/unittest-module-explorer

chenyushuo · 2025-08-01T08:09:43Z

/unittest-module-explorer

github-actions · 2025-08-01T08:16:10Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
20	20	0	0	0	0	347ms

Tests

Test Name	Status	Duration
tests/explorer/explorer_test.py::BaseExplorerCase::test_explorer	✅	1ms
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	✅	105ms
tests/explorer/explorer_test.py::TestExplorerCountdownNoEval::test_explorer	✅	96ms
tests/explorer/explorer_test.py::TestExplorerWithAddStrategy::test_explorer	✅	52ms
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	20ms
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution	✅	4ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	14ms
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	8ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	7ms
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	13ms
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable	✅	1ms

Github Test Reporter by CTRF 💚

hiyuchang

lgtm

vadimkantorov · 2025-09-11T01:45:59Z

Design and implement Pass@k evaluation, a metric of growing interest in RFT.

How would one implement Pass@k in the current Trinity design?

Am I understanding right that the eval workflow's run() still returns a list of multiple responses for a given input prompt?

For implementing Pass@k aggregation, I'd say that it should return a single response for a given input prompt? (and maybe still preserve the K responses inside of it)

pan-x-c · 2025-09-11T01:57:07Z

Implement a new workflow that inherits the original workflow, whose run method executes the parent class run method k times and collects the return value.

vadimkantorov · 2025-09-11T02:04:01Z

So in essence - it would return a list with a single response object, right? (matching a single prompt as input to run())

pan-x-c · 2025-09-11T02:09:56Z

Yes, the newly implemented Workflow needs to aggregate the Experience generated by k calls and returns only one Experience.

vadimkantorov · 2025-09-11T03:06:32Z

Does it make sense to extend the InferenceModel interface to allow passing to chat() a batch of prompts?

pan-x-c · 2025-09-11T03:16:59Z

The chat interface is used to simulate openai.chat.completions, which only allows one message list. If you want to pass a batch of prompts, you can consider manually applying the chat_template to the message list and then calling the generate method.

vadimkantorov · 2025-09-11T22:13:37Z

Hmm, for multi-turn workflows, for processing several of initial rollouts, I'm guessing being able to send a batch of different prompts to vllm could sometimes achieve better utilization

pan-x-c · 2025-09-12T02:41:31Z

Batch can indeed improve performance, but it will reduce the flexibility of workflow writing, which is a trade-off. For scenarios that are more concerned with performance, we recommen using the generate interface, which supports batch prompts and can achieve better performance.

lingzhq requested review from pan-x-c and yxdyc July 24, 2025 10:13

gemini-code-assist bot reviewed Jul 24, 2025

View reviewed changes

lingzhq force-pushed the dev/eval_workflow branch from 31d834e to e12f4b7 Compare July 29, 2025 13:35

yxdyc requested a review from chenyushuo July 30, 2025 06:43