Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: claude-4.6-opus
Benchmark: swe-bench
Agent Version: v1.11.0

Results

  • Accuracy: 74.8%
  • Total Cost: $277.81
  • Average Instance Cost: $0.56
  • Total Duration: 88868s (1481.1m)
  • Average Instance Runtime: 178s

Report Summary

  • Total instances: 500
  • Submitted instances: 500
  • Resolved instances: 374
  • Unresolved instances: 125
  • Empty patch instances: 1
  • Error instances: 0

Additional Metadata

  • completed_instances: 499
  • schema_version: 2
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@juanmichelini juanmichelini self-requested a review February 9, 2026 16:49
@juanmichelini
Copy link
Collaborator

@OpenHands the measure-progress task is seems stucked can you check?

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

Co-authored-by: openhands <[email protected]>
@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

✅ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 28
  Failed: 0

============================================================
VALIDATION PASSED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Measure Progress

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #529 at branch `eval/v1.11.0_claude-4.6-opus/swe-bench-20260209-163141`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

- Add swe-bench entry to results/claude-4.6-opus/scores.json
- Remove invalid v1.11.0_claude-4.6-opus directory that didn't match schema

The schema requires directory_name to match the model name, so results
should be added to the existing claude-4.6-opus directory.

Co-authored-by: openhands <[email protected]>
@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

Summary

I investigated and fixed the stuck "measure-progress" task for PR #529.

Issue Found

The CI workflow was never triggered for this PR. After pushing an empty commit to trigger it, the workflow ran but failed due to schema validation errors:

  1. metadata.json was missing required fields (country, release_date)
  2. directory_name (v1.11.0_claude-4.6-opus) didn't match the expected model name (claude-4.6-opus)
  3. scores.json was missing the agent_version field

Fix Applied

Instead of creating a new directory with an invalid name, I:

  • Added the swe-bench entry to the existing results/claude-4.6-opus/scores.json file
  • Removed the invalid results/v1.11.0_claude-4.6-opus/ directory

Result

The final diff is minimal - just adding 14 lines (the swe-bench entry) to the existing scores.json file:

  • ✅ Schema Validation: PASSED (28/28 files)
  • ✅ Progress Report: 100% complete (55/55 model×benchmark pairs)
  • ✅ CI workflow: SUCCESS

The PR is now ready for review at: #529

View full conversation

@juanmichelini juanmichelini merged commit 0f426a8 into main Feb 9, 2026
1 check passed
@juanmichelini juanmichelini deleted the eval/v1.11.0_claude-4.6-opus/swe-bench-20260209-163141 branch February 9, 2026 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants