Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: glm-4.7
Benchmark: swe-bench-multimodal
Agent Version: v1.10.0

Results

  • Accuracy: 14.7%
  • Total Cost: $67.11
  • Average Instance Cost: $0.66
  • Total Duration: 154927s (2582.1m)
  • Average Instance Runtime: 1519s

Report Summary

  • Total instances: 102
  • Submitted instances: 98
  • Resolved instances: 15
  • Unresolved instances: 79
  • Empty patch instances: 4
  • Error instances: 0

Additional Metadata

  • completed_instances: 94
  • schema_version: 2
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@juanmichelini
Copy link
Collaborator

@OpenHands
For swe-bench-multimodal, we are calculating a special "solveable score", here is an example:

https://github.com/OpenHands/openhands-index-results/blob/main/results/claude-4.5-opus/scores.json#L54-L75

See Data Source: ambiguity_annotations.json

The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords:

SOLVEABLE - Instance can be reasonably solved by an agent
HIDDEN_FUNCTIONAL_REQUIREMENT - Tests require knowledge not in the problem description
IDENTIFIER_NAME_UNDERSPECIFIED - Tests require specific names not in codebase
VISUAL_OUTPUT_MISMATCH - Tests require pixel-perfect PNG matching
UNRELATED_TESTS - Tests are unrelated to the problem
DATASET_BUG - Problem statement and tests don't match
2. Calculation Process

For this model's evaluation results:

a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL)

b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not

c. Calculate the metrics:

solveable_total: Count of instances with "SOLVEABLE" keyword
unsolveable_total: Count of instances without "SOLVEABLE" keyword
solveable_resolved: Count of resolved instances that are SOLVEABLE
unsolveable_resolved: Count of resolved instances that are NOT SOLVEABLE
solveable_accuracy: (solveable_resolved / solveable_total) × 100
unsolveable_accuracy: (unsolveable_resolved / unsolveable_total) × 100
combined_accuracy: (total_resolved / total_instances) × 100
3. Example: claude-4.6-opus

solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8%
unsolveable_resolved: 1, unsolveable_total: 33 → 1/33 = 3.03% ≈ 3.0%
combined_accuracy: (28+1)/(67+33) = 29/100 = 29.0%
4. Why this matters

The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against.

The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved.

@juanmichelini
Copy link
Collaborator

@OpenHands For swe-bench-multimodal, update the accuracy to use a special "solveable score", here is an example:

https://github.com/OpenHands/openhands-index-results/blob/main/results/claude-4.5-opus/scores.json#L54-L75

See Data Source: ambiguity_annotations.json

The ambiguity_annotations.json file from the OpenHands/benchmarks repository contains annotations for each instance in the SWE-bench Multimodal benchmark. Each instance is classified with keywords:

SOLVEABLE - Instance can be reasonably solved by an agent
HIDDEN_FUNCTIONAL_REQUIREMENT - Tests require knowledge not in the problem description
IDENTIFIER_NAME_UNDERSPECIFIED - Tests require specific names not in codebase
VISUAL_OUTPUT_MISMATCH - Tests require pixel-perfect PNG matching
UNRELATED_TESTS - Tests are unrelated to the problem
DATASET_BUG - Problem statement and tests don't match
2. Calculation Process

For this model's evaluation results:

a. Get resolved instances from the evaluation results (from the full_archive tar.gz URL)

b. Cross-reference each resolved instance with ambiguity_annotations.json to determine if it's SOLVEABLE or not

c. Calculate the metrics:

solveable_total: Count of instances with "SOLVEABLE" keyword
unsolveable_total: Count of instances without "SOLVEABLE" keyword
solveable_resolved: Count of resolved instances that are SOLVEABLE
unsolveable_resolved: Count of resolved instances that are NOT SOLVEABLE
solveable_accuracy: (solveable_resolved / solveable_total) × 100
unsolveable_accuracy: (unsolveable_resolved / unsolveable_total) × 100
combined_accuracy: (total_resolved / total_instances) × 100
3. Example: claude-4.6-opus

solveable_resolved: 28, solveable_total: 67 → 28/67 = 41.79% ≈ 41.8%
unsolveable_resolved: 1, unsolveable_total: 33 → 1/33 = 3.03% ≈ 3.0%
combined_accuracy: (28+1)/(67+33) = 29/100 = 29.0%
4. Why this matters

The swe-bench-multimodal benchmark now uses solveable_accuracy as the primary metric because it's more meaningful - it measures how well the model performs on instances that are actually solveable by an agent, filtering out instances with ambiguous requirements, visual output mismatches, or other issues that make them unfair to evaluate against.

The original score (which was combined_accuracy) included performance on unsolveable instances, which unfairly penalized models for not solving problems that couldn't reasonably be solved.

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

- Changed primary score from combined_accuracy (14.7%) to solveable_accuracy (22.1%)
- Added component_scores with detailed breakdown:
  - solveable_accuracy: 22.1% (15/68 solveable instances resolved)
  - unsolveable_accuracy: 0.0% (0/34 unsolveable instances resolved)
  - combined_accuracy: 14.7% (15/102 total instances resolved)

The solveable_accuracy metric is more meaningful as it measures performance
on instances that are actually solveable by an agent, filtering out instances
with ambiguous requirements, visual output mismatches, or other issues.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

I've successfully updated the swe-bench-multimodal results for glm-4.7 to use the solveable_accuracy metric. Here's a summary of the changes:

Changes Made

File Modified: results/glm-4.7/scores.json

Updates:

  1. ✅ Changed primary score from 14.7% (combined_accuracy) to 22.1% (solveable_accuracy)
  2. ✅ Changed metric from "accuracy" to "solveable_accuracy"
  3. ✅ Added component_scores object with detailed breakdown:
    • solveable_accuracy: 22.1% (15/68 solveable instances resolved)
    • unsolveable_accuracy: 0.0% (0/34 unsolveable instances resolved)
    • combined_accuracy: 14.7% (15/102 total instances resolved)
    • solveable_resolved: 15
    • solveable_total: 68
    • unsolveable_resolved: 0
    • unsolveable_total: 34

Verification:

  • The metrics were extracted from the evaluation results archive (report.json)
  • Cross-referenced with ambiguity_annotations.json from OpenHands/benchmarks repository
  • The format matches the example provided (claude-4.5-opus scores.json)

The changes have been committed and pushed to the branch eval/glm-4.7/swe-bench-multimodal-20260209-174705, which will update PR #531.

View full conversation

@juanmichelini juanmichelini merged commit 109e84d into main Feb 10, 2026
1 check passed
@juanmichelini juanmichelini deleted the eval/glm-4.7/swe-bench-multimodal-20260209-174705 branch February 10, 2026 09:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants