-
Notifications
You must be signed in to change notification settings - Fork 0
Add swe-bench-multimodal results for glm-4.7 #531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add swe-bench-multimodal results for glm-4.7 #531
Conversation
📊 Progress Report✅ Schema ValidationThis report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2. |
|
@OpenHands See Data Source: ambiguity_annotations.json The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords: SOLVEABLE - Instance can be reasonably solved by an agent For this model's evaluation results: a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL) b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not c. Calculate the metrics: solveable_total: Count of instances with "SOLVEABLE" keyword solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8% The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against. The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved. |
|
@OpenHands For swe-bench-multimodal, update the accuracy to use a special "solveable score", here is an example: See Data Source: ambiguity_annotations.json The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords: SOLVEABLE - Instance can be reasonably solved by an agent For this model's evaluation results: a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL) b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not c. Calculate the metrics: solveable_total: Count of instances with "SOLVEABLE" keyword solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8% The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against. The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved. |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
- Changed primary score from combined_accuracy (14.7%) to solveable_accuracy (22.1%) - Added component_scores with detailed breakdown: - solveable_accuracy: 22.1% (15/68 solveable instances resolved) - unsolveable_accuracy: 0.0% (0/34 unsolveable instances resolved) - combined_accuracy: 14.7% (15/102 total instances resolved) The solveable_accuracy metric is more meaningful as it measures performance on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues. Co-authored-by: openhands <[email protected]>
|
I've successfully updated the swe-bench-multimodal results for glm-4.7 to use the solveable_accuracy metric. Here's a summary of the changes: Changes MadeFile Modified: Updates:
Verification:
The changes have been committed and pushed to the branch |
Evaluation Results
Model:
glm-4.7Benchmark:
swe-bench-multimodalAgent Version:
v1.10.0Results
Report Summary
Additional Metadata
9420This PR was automatically created by the evaluation pipeline.