Fix repeat times in evaluation #410

chenyushuo · 2025-11-26T03:35:23Z

Description

As the title says.

Checklist

Please check the following items before code is ready to be reviewed.

Code has passed all tests
Docstrings have been added/updated in Google Style
Documentation has been updated
Code is ready for review

gemini-code-assist · 2025-11-26T03:35:44Z

Summary of Changes

Hello @chenyushuo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the evaluation capabilities by introducing 'evaluation at k' metrics. This allows for a more detailed understanding of model performance across different prediction horizons or sample sizes. The changes involve updating core configuration structures, refining metric calculation logic, and ensuring these new metrics are correctly processed and logged, ultimately providing richer insights into model behavior during evaluation.

Highlights

Evaluation Metric Enhancement: Introduced support for 'evaluation at k' (eval_at_k) metrics, allowing for more granular performance analysis by calculating mean, best, and worst scores at specified 'k' values during evaluation.
Configuration Updates: Added eval_at_k fields to StorageConfig and TasksetConfig to configure the 'evaluation at k' behavior. The repeat_times and n parameters are now automatically derived from eval_at_k or algorithm.repeat_times.
Metric Aggregation Refinement: Refactored calculate_task_level_metrics and gather_metrics functions to dynamically compute and report the new mean@k, best@k, and worst@k metrics, ensuring proper logging in TensorBoard.
Typo Correction: Fixed a minor typo in mix_sample_strategy.py, changing 'usal_buffer' to 'usual_buffer' for consistency in state dictionary handling.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

…l_configs

gemini-code-assist

Code Review

This pull request introduces a flexible evaluation mechanism using the eval_at_k configuration, allowing for more detailed metrics like mean@k, best@k, and worst@k. The changes are well-integrated across configuration, data processing, and metric calculation components. However, I've identified a critical issue in the calculation of best@k and worst@k metrics, and a potential crash scenario if eval_at_k is misconfigured as an empty list. Addressing these issues will ensure the new feature is robust and correct.

trinity/explorer/scheduler.py

trinity/common/config.py

chenyushuo · 2025-11-26T03:39:32Z

/unittest-all

…l_configs

pan-x-c · 2025-11-26T04:36:23Z

/unittest-all

trinity/common/config.py

github-actions · 2025-11-26T06:00:18Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
193	189	1	3	0	0	1h 21m

Failed Tests

Failed Tests ❌	Fail Message
❌ tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	The test failed in the call phase due to an assertion error

Skipped

Tests	Status
tests/explorer/workflow_test.py::TestAgentScopeWorkflowAdapter::test_adapter	skipped ⏭️
tests/trainer/trainer_test.py::TestMultiModalGRPO::test_trainer	skipped ⏭️
tests/trainer/trainer_test.py::TestMultiModalSFT::test_trainer	skipped ⏭️

Tests

Test Name	Status	Duration
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_std_grpo	✅	41ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_batch_level_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_duplicate_grpo	✅	5ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_advantage	✅	3ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_correct_bias	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_grpo_reward_std	✅	1ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_advantage	✅	2ms
tests/algorithm/advantage_fn_test.py::TestGroupedAdvantageFn::test_step_wise_grpo_with_std_threshold	✅	2ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_dpo_policy_loss	✅	2ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_gspo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_mix_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_opmd_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_ppo_policy_loss	✅	1ms
tests/algorithm/policy_loss_test.py::VerlPolicyLossTest::test_sft_policy_loss	✅	1ms
tests/buffer/experience_pipeline_test.py::TestExperiencePipeline::test_experience_pipeline	✅	20.4s
tests/buffer/experience_pipeline_test.py::TestExperiencePipeline::test_pass_rate_calculation	✅	21.6s
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_experience_buffer	✅	4.1s
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_storage_0_sft	✅	6.1s
tests/buffer/experience_storage_test.py::ExperienceStorageTest::test_sql_storage_1_dpo	✅	6.7s
tests/buffer/file_test.py::TestFileBuffer::test_file_reader	✅	157ms
tests/buffer/file_test.py::TestFileBuffer::test_file_writer	✅	4.4s
tests/buffer/formatter_test.py::TestFormatter::test_dpo_messages_formatter	✅	539ms
tests/buffer/formatter_test.py::TestFormatter::test_dpo_plaintext_formatter	✅	466ms
tests/buffer/formatter_test.py::TestFormatter::test_multi_modal_sft_formatter	✅	888ms
tests/buffer/formatter_test.py::TestFormatter::test_sft_messages_formatter	✅	993ms
tests/buffer/formatter_test.py::TestFormatter::test_sft_plaintext_formatter	✅	720ms
tests/buffer/formatter_test.py::TestFormatter::test_task_formatter	✅	222ms
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_buffer_reuse	✅	9.2s
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_capacity	✅	5.2s
tests/buffer/queue_test.py::TestQueueBuffer::test_priority_queue_reuse_count_control	✅	6.8s
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_0_queue	✅	6.1s
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_1_priority_queue	✅	6.0s
tests/buffer/queue_test.py::TestQueueBuffer::test_queue_buffer_capacity	✅	6.3s
tests/buffer/reader_test.py::TestBufferReader::test_buffer_reader_registration	✅	614ms
tests/buffer/reward_shaping_mapper_test.py::TestRewardShapingMapper::test_basic_usage	✅	6ms
tests/buffer/sql_test.py::TestSQLBuffer::test_sql_exp_buffer_read_write	✅	4.0s
tests/buffer/sql_test.py::TestSQLBuffer::test_sql_task_buffer_read_write	✅	4.5s
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_0	✅	91ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_1	✅	71ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_2	✅	110ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_3	✅	111ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_4	✅	112ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_5	✅	116ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_6	✅	131ms
tests/buffer/task_scheduler_test.py::TestTaskScheduler::test_task_scheduler_simple	✅	59ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_0_file	✅	71ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_1_sql	✅	4.0s
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_2_file	✅	52ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_3_sql	✅	4.1s
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_4_file	✅	52ms
tests/buffer/task_storage_test.py::TaskStorageTest::test_read_task_5_sql	✅	4.5s
tests/cli/launcher_test.py::TestLauncherMain::test_debug_mode	✅	48.0s
tests/cli/launcher_test.py::TestLauncherMain::test_main_run_command	✅	6.9s
tests/cli/launcher_test.py::TestLauncherMain::test_main_run_in_dlc	✅	1.8s
tests/cli/launcher_test.py::TestLauncherMain::test_main_studio_command	✅	337ms
tests/cli/launcher_test.py::TestLauncherMain::test_multi_stage_run	✅	1.7s
tests/common/config_test.py::TestConfig::test_all_examples_are_valid	✅	34.3s
tests/common/config_test.py::TestConfig::test_chat_template_path	✅	95ms
tests/common/config_test.py::TestConfig::test_config_flatten	✅	41ms
tests/common/config_test.py::TestConfig::test_continue_from_checkpoint_is_valid	✅	195ms
tests/common/config_test.py::TestConfig::test_default_workflow	✅	93ms
tests/common/config_test.py::TestConfig::test_load_default_config	✅	4.9s
tests/common/config_test.py::TestConfig::test_max_token_len_per_gpu_set_correctly	✅	96ms
tests/common/config_test.py::TestConfig::test_optimizer_config_propagation	✅	93ms
tests/common/config_test.py::TestConfig::test_update_config_from_ray_cluster	✅	161ms
tests/common/experience_test.py::TestEID::test_eid_properties	✅	1ms
tests/common/experience_test.py::TestExperience::test_action_mask_and_logprobs_type	✅	1ms
tests/common/experience_test.py::TestExperience::test_assertions	✅	1ms
tests/common/experience_test.py::TestExperience::test_dpo_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather	✅	1ms
tests/common/experience_test.py::TestExperience::test_gather_with_token_level_reward	✅	1ms
tests/common/experience_test.py::TestExperience::test_hf_datasets_conversion	✅	15ms
tests/common/experience_test.py::TestExperience::test_multi_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_serialize_deserialize	✅	1ms
tests/common/experience_test.py::TestExperience::test_single_turn_experience	✅	1ms
tests/common/experience_test.py::TestExperience::test_to_dict	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_dpo_experience_batch_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_experience_model_experience_conversion	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_gather_experiences_with_custom_fields	✅	1ms
tests/common/experience_test.py::TestExperienceConversion::test_multiturn_experience_batch_converstion	✅	1ms
tests/common/vllm_test.py::ModelWrapperTest_0::test_generate	✅	45.2s
tests/common/vllm_test.py::ModelWrapperTest_1::test_generate	✅	31.2s
tests/common/vllm_test.py::ModelWrapperTest_2::test_generate	✅	42.1s
tests/common/vllm_test.py::TestModelLen_0::test_model_len	✅	17.2s
tests/common/vllm_test.py::TestModelLen_1::test_model_len	✅	17.3s
tests/common/vllm_test.py::TestModelLenWithoutPromptTruncation::test_model_len	✅	16.8s
tests/common/vllm_test.py::TestAPIServer::test_api	✅	22.5s
tests/common/vllm_test.py::TestLogprobs::test_logprobs	✅	18.8s
tests/common/vllm_test.py::TestAsyncAPIServer::test_api_async	✅	23.2s
tests/common/vllm_test.py::TestTokenizer::test_action_mask	✅	248ms
tests/common/vllm_test.py::TestTokenizer::test_action_mask_with_tools	✅	232ms
tests/common/vllm_test.py::TestAPIServerToolCall_0_deepseek_r1::test_api_tool_calls	✅	19.7s
tests/common/vllm_test.py::TestAPIServerToolCall_1::test_api_tool_calls	✅	17.3s
tests/common/vllm_test.py::TestSuperLongGeneration::test_generate	✅	2m 9s
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	❌	1m 16s
tests/explorer/explorer_test.py::TestExplorerGSM8KRULERNoEval::test_explorer	✅	1m 45s
tests/explorer/explorer_test.py::TestExplorerGSM8k::test_explorer	✅	3m 38s
tests/explorer/explorer_test.py::ServeTest::test_serve	✅	1m 20s
tests/explorer/scheduler_test.py::SchedulerTest::test_async_workflow	✅	12.8s
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	12.0s
tests/explorer/scheduler_test.py::SchedulerTest::test_dynamic_timeout	✅	20.2s
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	28.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_non_repeatable_workflow_0	✅	12.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_non_repeatable_workflow_1	✅	12.4s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_repeatable_workflow_0	✅	12.2s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_repeatable_workflow_1	✅	12.5s
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution	✅	12.9s
tests/explorer/scheduler_test.py::SchedulerTest::test_non_repeatable_workflow	✅	12.8s
tests/explorer/scheduler_test.py::SchedulerTest::test_over_rollout_min_wait	✅	16.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	22.1s
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	23.9s
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	15.6s
tests/explorer/scheduler_test.py::SchedulerTest::test_stepwise_experience_eid	✅	32.5s
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	15.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	21.0s
tests/explorer/scheduler_test.py::TestRunnerStateCollection::test_runner_state_collection	✅	17.4s
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow_0	✅	2ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow_1	✅	602ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow_0	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow_1	✅	1.0s
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_raise_error	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_stop_at_max_env_steps	✅	1.0s
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	18ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	24ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	267ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	4ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	16ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	10ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	113ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable_0	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable_1	✅	101ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable_0	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable_1	✅	201ms
tests/explorer/workflow_test.py::MultiTurnWorkflowTest_0::test_multi_turn_workflow	✅	14.5s
tests/explorer/workflow_test.py::MultiTurnWorkflowTest_1::test_multi_turn_workflow	✅	14.5s
tests/explorer/workflow_test.py::TestWorkflowStateRecording::test_workflow_state_recording	✅	4.0s
tests/explorer/workflow_test.py::TestAgentScopeWorkflowAdapter::test_adapter	⏭️	1ms
tests/explorer/workflow_test.py::TestWorkflowRunner::test_workflow_runner	✅	302ms
tests/explorer/workflow_test.py::TestWorkflowRunner::test_workflow_runner_get_state	✅	8.1s
tests/manager/synchronizer_test.py::TestSynchronizerExit::test_synchronizer	✅	1m 6s
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_0::test_synchronizer	✅	1m 51s
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_1::test_synchronizer	✅	1m 53s
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_2::test_synchronizer	✅	2m 33s
tests/manager/synchronizer_test.py::TestStateDictBasedSynchronizer_3::test_synchronizer	✅	2m 36s
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_0::test_synchronizer	✅	1m 50s
tests/manager/synchronizer_test.py::TestNCCLBasedSynchronizer_1::test_synchronizer	✅	1m 50s
tests/service/data_juicer_test.py::TestDataJuicer::test_config	✅	2.2s
tests/service/data_juicer_test.py::TestDataJuicer::test_server_start	✅	21.6s
tests/service/data_juicer_test.py::TestDataJuicerExperiencePipeline::test_data_juicer_operators	✅	31.7s
tests/service/data_juicer_test.py::TestDataJuicerTaskPipeline::test_data_juicer_task_pipeline	✅	14.3s
tests/trainer/trainer_test.py::TestTrainerCountdown_0_fsdp::test_trainer	✅	3m 7s
tests/trainer/trainer_test.py::TestTrainerCountdown_1_megatron::test_trainer	✅	4m 7s
tests/trainer/trainer_test.py::TestStepAheadAsyncRL::test_trainer	✅	1m 27s
tests/trainer/trainer_test.py::TestTrainerGSM8K_0_fsdp::test_trainer	✅	1m 27s
tests/trainer/trainer_test.py::TestTrainerGSM8K_1_fsdp2::test_trainer	✅	1m 25s
tests/trainer/trainer_test.py::TestTrainerGSM8K_2_fsdp::test_trainer	✅	1m 28s
tests/trainer/trainer_test.py::TestTrainerGSM8K_3_fsdp2::test_trainer	✅	1m 33s
tests/trainer/trainer_test.py::TestTrainerSFTWarmupGSM8K::test_trainer	✅	2m 36s
tests/trainer/trainer_test.py::TestTrainerDPO::test_trainer	✅	1m 5s
tests/trainer/trainer_test.py::TestTrainerSFT::test_trainer	✅	1m 2s
tests/trainer/trainer_test.py::TestTrainerToolsSFT::test_trainer_tools	✅	1m 3s
tests/trainer/trainer_test.py::TestFullyAsyncMode_0_fsdp::test_fully_async_mode	✅	1m 56s
tests/trainer/trainer_test.py::TestFullyAsyncMode_1_fsdp::test_fully_async_mode	✅	2m 1s
tests/trainer/trainer_test.py::TestFullyAsyncMode_2_megatron::test_fully_async_mode	✅	2m 27s
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_0_fsdp::test_trainer	✅	2m 20s
tests/trainer/trainer_test.py::TestTrainerCheckpointSave_1_megatron::test_trainer	✅	4m 4s
tests/trainer/trainer_test.py::TestTrainerMIX::test_trainer	✅	2m 33s
tests/trainer/trainer_test.py::TestMultiModalGRPO::test_trainer	⏭️	810ms
tests/trainer/trainer_test.py::TestMultiModalSFT::test_trainer	⏭️	807ms
tests/trainer/trainer_test.py::TestTrainerLoRA::test_trainer	✅	3m 39s
tests/trainer/trainer_test.py::TestOverRollout::test_trainer	✅	1m 21s
tests/utils/eval_utils_test.py::TestComputeScore::test_both_boxed_and_equivalent	✅	15ms
tests/utils/eval_utils_test.py::TestComputeScore::test_both_boxed_and_not_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_empty_ground_truth	✅	2ms
tests/utils/eval_utils_test.py::TestComputeScore::test_empty_solution_string	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_multiple_boxed_answers_in_solution	✅	2ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_boxed_truth_raw_and_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_boxed_truth_raw_and_not_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_not_boxed	✅	1ms
tests/utils/eval_utils_test.py::TestComputeScore::test_solution_raw_and_ground_truth_boxed_equivalent	✅	1ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_extract_answer	✅	4ms
tests/utils/eval_utils_test.py::TestMathEvalUtils::test_verify_math_answer	✅	74ms
tests/utils/eval_utils_test.py::TestEvalUtils::test_is_equiv	✅	6ms
tests/utils/log_test.py::LogTest::test_actor_log	✅	5.1s
tests/utils/log_test.py::LogTest::test_group_by_node	✅	4.9s
tests/utils/log_test.py::LogTest::test_no_actor_log	✅	909ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_local_0__workspace_tests_utils_plugins	✅	98ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_local_1_tests_utils_plugins	✅	95ms
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_remote_0__workspace_tests_utils_plugins	✅	22.1s
tests/utils/plugin_test.py::TestPluginLoader::test_load_plugins_remote_1_tests_utils_plugins	✅	22.1s
tests/utils/plugin_test.py::TestPluginLoader::test_passing_custom_class_0__workspace_tests_utils_plugins	✅	12.0s
tests/utils/plugin_test.py::TestPluginLoader::test_passing_custom_class_1_tests_utils_plugins	✅	11.8s

Github Test Reporter by CTRF 💚

…ow_runner

chenyushuo · 2025-11-26T06:46:53Z

/unittest-module-explorer

github-actions · 2025-11-26T07:03:09Z

Summary

Tests 📝	Passed ✅	Failed ❌	Skipped ⏭️	Other ❓	Flaky 🍂	Duration ⏱️
45	44	0	1	0	0	14m 30s

Skipped

Tests	Status
tests/explorer/workflow_test.py::TestAgentScopeWorkflowAdapter::test_adapter	skipped ⏭️

Tests

Test Name	Status	Duration
tests/explorer/explorer_test.py::TestExplorerCountdownEval::test_explorer	✅	1m 28s
tests/explorer/explorer_test.py::TestExplorerGSM8KRULERNoEval::test_explorer	✅	1m 44s
tests/explorer/explorer_test.py::TestExplorerGSM8k::test_explorer	✅	3m 37s
tests/explorer/explorer_test.py::ServeTest::test_serve	✅	1m 23s
tests/explorer/scheduler_test.py::SchedulerTest::test_async_workflow	✅	12.5s
tests/explorer/scheduler_test.py::SchedulerTest::test_concurrent_operations	✅	12.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_dynamic_timeout	✅	20.4s
tests/explorer/scheduler_test.py::SchedulerTest::test_get_results	✅	27.6s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_non_repeatable_workflow_0	✅	12.7s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_non_repeatable_workflow_1	✅	12.6s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_repeatable_workflow_0	✅	12.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_metric_calculation_with_repeatable_workflow_1	✅	12.1s
tests/explorer/scheduler_test.py::SchedulerTest::test_multi_step_execution	✅	12.7s
tests/explorer/scheduler_test.py::SchedulerTest::test_non_repeatable_workflow	✅	12.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_over_rollout_min_wait	✅	16.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_all_methods	✅	22.3s
tests/explorer/scheduler_test.py::SchedulerTest::test_scheduler_restart_after_stop	✅	23.6s
tests/explorer/scheduler_test.py::SchedulerTest::test_split_tasks	✅	15.5s
tests/explorer/scheduler_test.py::SchedulerTest::test_stepwise_experience_eid	✅	32.6s
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all	✅	15.5s
tests/explorer/scheduler_test.py::SchedulerTest::test_wait_all_timeout_with_multi_batch	✅	21.4s
tests/explorer/scheduler_test.py::TestRunnerStateCollection::test_runner_state_collection	✅	17.5s
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow_0	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_reward_propagation_workflow_1	✅	602ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow_0	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_step_wise_reward_workflow_1	✅	1.0s
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_raise_error	✅	1ms
tests/explorer/step_wise_workflow_test.py::WorkflowTest::test_workflows_stop_at_max_env_steps	✅	1.0s
tests/explorer/workflow_test.py::WorkflowTest::test_gsm8k_workflow	✅	33ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_boxed_workflow	✅	25ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_complex_workflow	✅	173ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_eval_workflow	✅	4ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_fraction_workflow	✅	13ms
tests/explorer/workflow_test.py::WorkflowTest::test_math_workflow	✅	8ms
tests/explorer/workflow_test.py::WorkflowTest::test_rm_gallery_workflow	✅	131ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable_0	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_repeatable_1	✅	101ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable_0	✅	1ms
tests/explorer/workflow_test.py::WorkflowTest::test_workflow_resettable_1	✅	201ms
tests/explorer/workflow_test.py::MultiTurnWorkflowTest_0::test_multi_turn_workflow	✅	14.9s
tests/explorer/workflow_test.py::MultiTurnWorkflowTest_1::test_multi_turn_workflow	✅	14.6s
tests/explorer/workflow_test.py::TestWorkflowStateRecording::test_workflow_state_recording	✅	4.0s
tests/explorer/workflow_test.py::TestAgentScopeWorkflowAdapter::test_adapter	⏭️	1ms
tests/explorer/workflow_test.py::TestWorkflowRunner::test_workflow_runner	✅	295ms
tests/explorer/workflow_test.py::TestWorkflowRunner::test_workflow_runner_get_state	✅	8.1s

Github Test Reporter by CTRF 💚

Fix repeat times in evaluation

efd7bd2

Merge branch 'main' of github.com:modelscope/Trinity-RFT into fix/eva…

70ccf4d

…l_configs

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

trinity/explorer/scheduler.py Outdated Show resolved Hide resolved

trinity/common/config.py Outdated Show resolved Hide resolved

chenyushuo requested a review from pan-x-c November 26, 2025 03:36

apply suggestions from gemini

aaa6aae

chenyushuo added 2 commits November 26, 2025 12:23

remove eval_at_k

fa2dbf2

Merge branch 'main' of github.com:modelscope/Trinity-RFT into fix/eva…

3f86ba9

…l_configs

pan-x-c reviewed Nov 26, 2025

View reviewed changes

trinity/common/config.py Outdated Show resolved Hide resolved

trinity/common/config.py Outdated Show resolved Hide resolved

doc fix

97935f8

fix unittest and rename task_execution to run_execution in workfl…

f437b66

…ow_runner

pan-x-c approved these changes Nov 26, 2025

View reviewed changes

pan-x-c merged commit 89dd059 into modelscope:main Nov 26, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix repeat times in evaluation #410

Fix repeat times in evaluation #410

Uh oh!

chenyushuo commented Nov 26, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

chenyushuo commented Nov 26, 2025

Uh oh!

pan-x-c commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

chenyushuo commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix repeat times in evaluation #410

Fix repeat times in evaluation #410

Uh oh!

Conversation

chenyushuo commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

chenyushuo commented Nov 26, 2025

Uh oh!

pan-x-c commented Nov 26, 2025

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 26, 2025

Summary

Failed Tests

Skipped

Tests

Uh oh!

chenyushuo commented Nov 26, 2025

Uh oh!

github-actions bot commented Nov 26, 2025

Summary

Skipped

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chenyushuo commented Nov 26, 2025 •

edited

Loading