Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: qwen-3-coder
Benchmark: gaia
Agent Version: v1.11.0

Results

  • Accuracy: 24.8%
  • Total Cost: $0.00
  • Average Instance Cost: $0.00
  • Total Duration: 113930s (1898.8m)
  • Average Instance Runtime: 690s

Report Summary

  • Total instances: 165
  • Submitted instances: 165
  • Resolved instances: 41
  • Unresolved instances: 124
  • Empty patch instances: 0
  • Error instances: 0

Additional Metadata

  • completed_instances: 165
  • incomplete_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

❌ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 28
  Passed: 27
  Failed: 1

Errors:
  - /home/runner/work/openhands-index-results/openhands-index-results/results/qwen-3-coder/scores.json: Entry 0:
  • Field 'cost_per_instance': Input should be greater than 0 (got: 0.0)

============================================================
VALIDATION FAILED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Measure Progress

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #536 at branch `eval/qwen-3-coder/gaia-20260209-232952`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants