-
Notifications
You must be signed in to change notification settings - Fork 0
Add swe-bench results for claude-4.6-opus #529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add swe-bench results for claude-4.6-opus #529
Conversation
|
@OpenHands the measure-progress task is seems stucked can you check? |
|
I'm on it! juanmichelini can track my progress at all-hands.dev |
Co-authored-by: openhands <[email protected]>
📊 Progress Report✅ Schema ValidationThis report measures progress towards the 3D array goal (benchmarks × models × metrics) as described in #2. |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
- Add swe-bench entry to results/claude-4.6-opus/scores.json - Remove invalid v1.11.0_claude-4.6-opus directory that didn't match schema The schema requires directory_name to match the model name, so results should be added to the existing claude-4.6-opus directory. Co-authored-by: openhands <[email protected]>
SummaryI investigated and fixed the stuck "measure-progress" task for PR #529. Issue FoundThe CI workflow was never triggered for this PR. After pushing an empty commit to trigger it, the workflow ran but failed due to schema validation errors:
Fix AppliedInstead of creating a new directory with an invalid name, I:
ResultThe final diff is minimal - just adding 14 lines (the swe-bench entry) to the existing scores.json file:
The PR is now ready for review at: #529 |
Evaluation Results
Model:
claude-4.6-opusBenchmark:
swe-benchAgent Version:
v1.11.0Results
Report Summary
Additional Metadata
49920This PR was automatically created by the evaluation pipeline.