How is open-swe evaluated, and what is the current state of evaluation? #790
Unanswered
arthurthlee
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Just wondering, what are some evaluation metrics that are used to evaluate open-swe? I can't find any benchmarks including open-swe's performance on any leaderboards such as swebench, and I can't find much documentation on this in the repo.
I saw that a recent PR states:
In the code, I can see that it's evaluated on a ruffScore (Linting) and mypyScore (Type checking). Are there any metrics now, or in the works, for determining code logic/behaviour and consistency?
Beta Was this translation helpful? Give feedback.
All reactions