Skip to content

Conversation

@all-hands-bot
Copy link
Collaborator

Evaluation Results

Model: qwen3-coder-next
Benchmark: swe-bench
Agent Version: v1.11.1

Results

  • Accuracy: 66.6%
  • Total Cost: $680.26
  • Average Instance Cost: $1.36
  • Total Duration: 722734s (12045.6m)
  • Average Instance Runtime: 1445s

Report Summary

  • Total instances: 500
  • Submitted instances: 491
  • Resolved instances: 333
  • Unresolved instances: 156
  • Empty patch instances: 2
  • Error instances: 0

Additional Metadata

  • completed_instances: 489
  • schema_version: 2
  • unstopped_instances: 0

This PR was automatically created by the evaluation pipeline.

@github-actions
Copy link

github-actions bot commented Feb 9, 2026

📊 Progress Report

============================================================
OpenHands Index Results - Progress Report
============================================================

Target: Complete all model × benchmark pairs
  11 models × 5 benchmarks = 55 pairs
  (each pair requires all 3 metrics: score, cost_per_instance, average_runtime)

============================================================
OVERALL PROGRESS: ⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛⬛ 100.0%
  Complete: 55 / 55 pairs
============================================================

❌ Schema Validation

============================================================
Schema Validation Report
============================================================

Results directory: /home/runner/work/openhands-index-results/openhands-index-results/results
Files validated: 30
  Passed: 29
  Failed: 1

Errors:
  - /home/runner/work/openhands-index-results/openhands-index-results/results/qwen3-coder-next/metadata.json:   • Field 'model': Input should be 'claude-4.6-opus', 'claude-4.5-opus', 'claude-4.5-sonnet', 'gemini-3-pro', 'gemini-3-flash', 'glm-4.7', 'gpt-5.2', 'gpt-5.2-codex', 'kimi-k2-thinking', 'kimi-k2.5', 'minimax-m2.1', 'deepseek-v3.2-reasoner', 'qwen-3-coder' or 'nemotron-3-nano' (got: 'qwen3-coder-next')

============================================================
VALIDATION FAILED
============================================================

This report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2.

@juanmichelini
Copy link
Collaborator

@OpenHands search the web and add fields for "country", "release_date", "parameter_count_b", "active_parameter_count_b" show your work add a comment in the PR with links proving each of the fields.

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

…b to qwen3-coder-next metadata

- country: cn (Alibaba is a Chinese company)
- release_date: 2026-02-02 (from official Qwen blog)
- parameter_count_b: 80 (80B total parameters)
- active_parameter_count_b: 3 (3B active parameters per token)

Co-authored-by: openhands <[email protected]>
Copy link
Collaborator

Sources for Added Metadata Fields

I have added the following fields to qwen3-coder-next/metadata.json:

Field Value Source
country cn Alibaba is a Chinese company headquartered in Hangzhou, China
release_date 2026-02-02 Qwen Official Blog - Shows "2026/02/02" as the publication date
parameter_count_b 80 Hugging Face Model Card - "Model size: 80B params"
active_parameter_count_b 3 Hugging Face Model Card - "80B total parameters and only 3B activated per token"

Additional Supporting Sources:

  • NewsBreak Article - "Released February 3 under an Apache 2.0 license", "80 billion parameters that activates just 3 billion per forward pass"
  • VentureBeat - "80 billion total parameters, it utilizes an ultra-sparse Mixture-of-Experts (MoE) architecture that activates only 3 billion parameters per forward pass"
  • Dev.to Guide - "Total Parameters: 80B, Activated Parameters: 3B (per inference)"

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

Looks like there are a few issues preventing this PR from being merged!

  • GitHub Actions are failing:
    • Measure Progress

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #539 at branch `eval/qwen3-coder-next/swe-bench-20260209-234004`

Feel free to include any additional details that might help me get this PR into a better state.

You can manage your notification settings

@openhands-ai
Copy link

openhands-ai bot commented Feb 9, 2026

Summary

I have completed the requested task to add metadata fields for the qwen3-coder-next model. Here's what was done:

Changes Made

Updated results/qwen3-coder-next/metadata.json with the following fields:

Field Value Source
country cn Alibaba is a Chinese company headquartered in Hangzhou, China
release_date 2026-02-02 Qwen Official Blog
parameter_count_b 80 Hugging Face Model Card
active_parameter_count_b 3 Hugging Face Model Card

Checklist

  • Added country field
  • Added release_date field
  • Added parameter_count_b field
  • Added active_parameter_count_b field
  • Committed and pushed changes to the PR branch
  • Added a comment on PR Add swe-bench results for qwen3-coder-next #539 with links proving each field's source

PR Comment

A detailed comment with source links has been added to PR #539: #539 (comment)

The comment includes:

  • Primary sources (Qwen Official Blog, Hugging Face Model Card)
  • Additional supporting sources (NewsBreak, VentureBeat, Dev.to)

View full conversation

Copy link
Collaborator

@juanmichelini juanmichelini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@juanmichelini
Copy link
Collaborator

@OpenHands fix the validation errors!

@openhands-ai
Copy link

openhands-ai bot commented Feb 10, 2026

I'm on it! juanmichelini can track my progress at all-hands.dev

@juanmichelini
Copy link
Collaborator

https://github.com/OpenHands fix the validation errors!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants