Add swe-bench results for claude-4.6-opus #529

all-hands-bot · 2026-02-09T16:31:44Z

Evaluation Results

Model: claude-4.6-opus
Benchmark: swe-bench
Agent Version: v1.11.0

Results

Accuracy: 74.8%
Total Cost: $277.81
Average Instance Cost: $0.56
Total Duration: 88868s (1481.1m)
Average Instance Runtime: 178s

Report Summary

Total instances: 500
Submitted instances: 500
Resolved instances: 374
Unresolved instances: 125
Empty patch instances: 1
Error instances: 0

Additional Metadata

completed_instances: 499
schema_version: 2
unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

juanmichelini · 2026-02-09T17:10:01Z

@OpenHands the measure-progress task is seems stucked can you check?

openhands-ai · 2026-02-09T17:10:11Z

I'm on it! juanmichelini can track my progress at all-hands.dev

Co-authored-by: openhands <[email protected]>

github-actions · 2026-02-09T17:12:55Z

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

openhands-ai · 2026-02-09T17:13:09Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- Measure Progress

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #529 at branch `eval/v1.11.0_claude-4.6-opus/swe-bench-20260209-163141`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

- Add swe-bench entry to results/claude-4.6-opus/scores.json - Remove invalid v1.11.0_claude-4.6-opus directory that didn't match schema The schema requires directory_name to match the model name, so results should be added to the existing claude-4.6-opus directory. Co-authored-by: openhands <[email protected]>

openhands-ai · 2026-02-09T17:16:49Z

Summary

I investigated and fixed the stuck "measure-progress" task for PR #529.

Issue Found

The CI workflow was never triggered for this PR. After pushing an empty commit to trigger it, the workflow ran but failed due to schema validation errors:

metadata.json was missing required fields (country, release_date)
directory_name (v1.11.0_claude-4.6-opus) didn't match the expected model name (claude-4.6-opus)
scores.json was missing the agent_version field

Fix Applied

Instead of creating a new directory with an invalid name, I:

Added the swe-bench entry to the existing results/claude-4.6-opus/scores.json file
Removed the invalid results/v1.11.0_claude-4.6-opus/ directory

Result

The final diff is minimal - just adding 14 lines (the swe-bench entry) to the existing scores.json file:

✅ Schema Validation: PASSED (28/28 files)
✅ Progress Report: 100% complete (55/55 model×benchmark pairs)
✅ CI workflow: SUCCESS

The PR is now ready for review at: #529

View full conversation

Add swe-bench results for claude-4.6-opus

775333e

juanmichelini self-requested a review February 9, 2026 16:49

juanmichelini approved these changes Feb 9, 2026

View reviewed changes

Trigger CI workflow

2a5bee6

Co-authored-by: openhands <[email protected]>

juanmichelini merged commit 0f426a8 into main Feb 9, 2026
1 check passed

juanmichelini deleted the eval/v1.11.0_claude-4.6-opus/swe-bench-20260209-163141 branch February 9, 2026 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add swe-bench results for claude-4.6-opus #529

Add swe-bench results for claude-4.6-opus #529

Uh oh!

all-hands-bot commented Feb 9, 2026

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add swe-bench results for claude-4.6-opus #529

Add swe-bench results for claude-4.6-opus #529

Uh oh!

Conversation

all-hands-bot commented Feb 9, 2026

Evaluation Results

Results

Report Summary

Additional Metadata

Uh oh!

juanmichelini commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

github-actions bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📊 Progress Report

✅ Schema Validation

Uh oh!

openhands-ai bot commented Feb 9, 2026

Uh oh!

openhands-ai bot commented Feb 9, 2026

Summary

Issue Found

Fix Applied

Result

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions bot commented Feb 9, 2026 •

edited

Loading