Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: glm-4.7
Benchmark: swt-bench
Agent Version: v1.10.0

Results

  • Accuracy: 49.4%
  • Total Cost: $156.32
  • Average Instance Cost: $0.37
  • Total Duration: 314009s (5233.5m)
  • Average Instance Runtime: 744s

⚠️ REVIEWER NOTE

total_instances (422) does not match expected (433).
Accuracy is calculated using expected_instances (433) as the denominator.


Report Summary

  • Total instances: 422
  • Submitted instances: 422
  • Resolved instances: 214
  • Unresolved instances: 206
  • Empty patch instances: 0
  • Error instances: 2

Additional Metadata

  • Mean coverage: 0.7815494682208436
  • Mean coverage delta: 0.6100117958136516
  • completed_instances: 420
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@juanmichelini juanmichelini merged commit ad75f38 into main Feb 9, 2026
1 check passed
@juanmichelini juanmichelini deleted the eval/glm-4.7/swt-bench-20260209-175040 branch February 9, 2026 19:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants