Add swe-bench-multimodal results for glm-4.7 #531

all-hands-bot · 2026-02-09T17:47:07Z

Evaluation Results

Model: glm-4.7
Benchmark: swe-bench-multimodal
Agent Version: v1.10.0

Results

Accuracy: 14.7%
Total Cost: $67.11
Average Instance Cost: $0.66
Total Duration: 154927s (2582.1m)
Average Instance Runtime: 1519s

Report Summary

Total instances: 102
Submitted instances: 98
Resolved instances: 15
Unresolved instances: 79
Empty patch instances: 4
Error instances: 0

Additional Metadata

completed_instances: 94
schema_version: 2
unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

github-actions · 2026-02-09T17:47:25Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

juanmichelini · 2026-02-09T19:46:59Z

@OpenHands
For swe-bench-multimodal, we are calculating a special "solveable score", here is an example:

https://github.com/OpenHands/openhands-index-results/blob/main/results/claude-4.5-opus/scores.json#L54-L75

See Data Source: ambiguity_annotations.json

The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords:

SOLVEABLE - Instance can be reasonably solved by an agent
HIDDEN_FUNCTIONAL_REQUIREMENT - Tests require knowledge not in the problem description
IDENTIFIER_NAME_UNDERSPECIFIED - Tests require specific names not in codebase
VISUAL_OUTPUT_MISMATCH - Tests require pixel-perfect PNG matching
UNRELATED_TESTS - Tests are unrelated to the problem
DATASET_BUG - Problem statement and tests don't match
2. Calculation Process

For this model's evaluation results:

a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL)

b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not

c. Calculate the metrics:

solveable_total: Count of instances with "SOLVEABLE" keyword
unsolveable_total: Count of instances without "SOLVEABLE" keyword
solveable_resolved: Count of resolved instances that are SOLVEABLE
unsolveable_resolved: Count of resolved instances that are NOT SOLVEABLE
solveable_accuracy: (solveable_resolved / solveable_total) × 100
unsolveable_accuracy: (unsolveable_resolved / unsolveable_total) × 100
combined_accuracy: (total_resolved / total_instances) × 100
3. Example: claude-4.6-opus

solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8%
unsolveable_resolved: 1, unsolveable_total: 33 → 1/33 = 3.03% ≈ 3.0%
combined_accuracy: (28+1)/(67+33) = 29/100 = 29.0%
4. Why this matters

The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against.

The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved.

…74705

juanmichelini · 2026-02-09T23:14:39Z

@OpenHands For swe-bench-multimodal, update the accuracy to use a special "solveable score", here is an example:

https://github.com/OpenHands/openhands-index-results/blob/main/results/claude-4.5-opus/scores.json#L54-L75

See Data Source: ambiguity_annotations.json

The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords:

SOLVEABLE - Instance can be reasonably solved by an agent
HIDDEN_FUNCTIONAL_REQUIREMENT - Tests require knowledge not in the problem description
IDENTIFIER_NAME_UNDERSPECIFIED - Tests require specific names not in codebase
VISUAL_OUTPUT_MISMATCH - Tests require pixel-perfect PNG matching
UNRELATED_TESTS - Tests are unrelated to the problem
DATASET_BUG - Problem statement and tests don't match
2. Calculation Process

For this model's evaluation results:

a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL)

b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not

c. Calculate the metrics:

solveable_total: Count of instances with "SOLVEABLE" keyword
unsolveable_total: Count of instances without "SOLVEABLE" keyword
solveable_resolved: Count of resolved instances that are SOLVEABLE
unsolveable_resolved: Count of resolved instances that are NOT SOLVEABLE
solveable_accuracy: (solveable_resolved / solveable_total) × 100
unsolveable_accuracy: (unsolveable_resolved / unsolveable_total) × 100
combined_accuracy: (total_resolved / total_instances) × 100
3. Example: claude-4.6-opus

solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8%
unsolveable_resolved: 1, unsolveable_total: 33 → 1/33 = 3.03% ≈ 3.0%
combined_accuracy: (28+1)/(67+33) = 29/100 = 29.0%
4. Why this matters

The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against.

The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved.

openhands-ai · 2026-02-09T23:14:48Z

I'm on it! juanmichelini can track my progress at all-hands.dev

- Changed primary score from combined_accuracy (14.7%) to solveable_accuracy (22.1%) - Added component_scores with detailed breakdown: - solveable_accuracy: 22.1% (15/68 solveable instances resolved) - unsolveable_accuracy: 0.0% (0/34 unsolveable instances resolved) - combined_accuracy: 14.7% (15/102 total instances resolved) The solveable_accuracy metric is more meaningful as it measures performance on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues. Co-authored-by: openhands <[email protected]>

openhands-ai · 2026-02-09T23:17:49Z

I've successfully updated the swe-bench-multimodal results for glm-4.7 to use the solveable_accuracy metric. Here's a summary of the changes:

Changes Made

File Modified: results/glm-4.7/scores.json

Updates:

✅ Changed primary score from 14.7% (combined_accuracy) to 22.1% (solveable_accuracy)
✅ Changed metric from "accuracy" to "solveable_accuracy"
✅ Added component_scores object with detailed breakdown:
- solveable_accuracy: 22.1% (15/68 solveable instances resolved)
- unsolveable_accuracy: 0.0% (0/34 unsolveable instances resolved)
- combined_accuracy: 14.7% (15/102 total instances resolved)
- solveable_resolved: 15
- solveable_total: 68
- unsolveable_resolved: 0
- unsolveable_total: 34

Verification:

The metrics were extracted from the evaluation results archive (report.json)
Cross-referenced with ambiguity_annotations.json from OpenHands/benchmarks repository
The format matches the example provided (claude-4.5-opus scores.json)

The changes have been committed and pushed to the branch eval/glm-4.7/swe-bench-multimodal-20260209-174705, which will update PR #531.

View full conversation

…74705

Add swe-bench-multimodal results for glm-4.7

3f96eb4

all-hands-bot requested a review from juanmichelini February 9, 2026 17:47

Merge branch 'main' into eval/glm-4.7/swe-bench-multimodal-20260209-1…

41c4850

…74705

juanmichelini approved these changes Feb 10, 2026

View reviewed changes

Merge branch 'main' into eval/glm-4.7/swe-bench-multimodal-20260209-1…

5de4b17

…74705

juanmichelini merged commit 109e84d into main Feb 10, 2026
1 check passed

juanmichelini deleted the eval/glm-4.7/swe-bench-multimodal-20260209-174705 branch February 10, 2026 09:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swe-bench-multimodal results for glm-4.7 #531

Add swe-bench-multimodal results for glm-4.7 #531

Uh oh!

all-hands-bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add swe-bench-multimodal results for glm-4.7 #531

Add swe-bench-multimodal results for glm-4.7 #531

Uh oh!

Conversation

all-hands-bot commented Feb 9, 2026

Evaluation Results

Results

Report Summary

Additional Metadata

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Progress Report

✅ Schema Validation

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Changes Made

Updates:

Verification:

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 9, 2026 •

edited

Loading