Skip to content

feat(evaluation): add majority and selector#10

Closed
ntudy wants to merge 6 commits intomainfrom
dev-vote-select
Closed

feat(evaluation): add majority and selector#10
ntudy wants to merge 6 commits intomainfrom
dev-vote-select

Conversation

@ntudy
Copy link
Contributor

@ntudy ntudy commented Aug 14, 2025

Describe this PR

Checklist for PR

Must Do

  • Write a good PR title and description, i.e. feat(agent): add pdf tool via mcp, perf: make llm client async and fix(utils): load custom config via importlib etc. CI job check-pr-title enforces Angular commit message format to PR title.
  • Run make precommit locally. CI job lint enforce ruff default format/lint rules on all new codes.
  • Run make pytest. Check test summary (located at report.html) and coverage report (located at htmlcov/index.html) on new codes.

Nice To Have

  • (Optional) Write/update tests under /tests for feat and test PR.
  • (Optional) Write/update docs under /docs for docs and ci PR.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds evaluation functionality for analyzing multiple agent runs, including majority voting and LLM-based solution selection. The changes introduce two new analysis methods to help select the best solutions from multiple agent attempts.

  • Adds majority vote calculation for determining task success based on multiple runs
  • Implements LLM-powered solution selector to choose best answers from multiple attempts
  • Updates evaluation scripts to include the new analysis tools

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
apps/run-agent/scripts/claude-sonnet-3.7/run_evaluate_multiple_runs_gaia-validation.sh Adds new evaluation steps for score calculation and solution selection
apps/run-agent/main.py Registers the new llm-solution-selector command
apps/run-agent/llm_solution_selector.py New module implementing LLM-based solution selection with async processing
apps/run-agent/calculate_score_from_log.py Adds majority vote calculation and updates log file pattern matching

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@ntudy ntudy closed this Aug 19, 2025
@ntudy ntudy deleted the dev-vote-select branch August 27, 2025 05:55
BinWang28 pushed a commit that referenced this pull request Mar 11, 2026
feat(tool-blacklist): add tool blacklist function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants