Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: glm-4.7
Benchmark: swe-bench
Agent Version: v1.10.0

Results

  • Accuracy: 73.4%
  • Total Cost: $255.45
  • Average Instance Cost: $0.51
  • Total Duration: 503521s (8392.0m)
  • Average Instance Runtime: 1007s

Report Summary

  • Total instances: 500
  • Submitted instances: 498
  • Resolved instances: 367
  • Unresolved instances: 129
  • Empty patch instances: 0
  • Error instances: 2

Additional Metadata

  • completed_instances: 496
  • schema_version: 2
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juanmichelini juanmichelini merged commit 949f99f into main Feb 9, 2026
1 check passed
@juanmichelini juanmichelini deleted the eval/glm-4.7/swe-bench-20260209-174604 branch February 9, 2026 19:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants