How is open-swe evaluated, and what is the current state of evaluation? #790

arthurthlee · 2025-08-21T22:01:09Z

arthurthlee
Aug 21, 2025

Just wondering, what are some evaluation metrics that are used to evaluate open-swe? I can't find any benchmarks including open-swe's performance on any leaderboards such as swebench, and I can't find much documentation on this in the repo.

I saw that a recent PR states:


This PR sets up the building blocks for the internal benchmark for open-swe.

Fetches langgraph PRs with over 25 characters in the PR body, and that modify a test file
Creates an initial script that reads in this processed json stored in langgraph_prs.json, spins up a daytona session, checks out the branch at the time of the commit, and then reverts to before that merge commit was created.
Moves the setupEnv function to utils in order to reuse it inside of the evals, and the internal benchmark.

In the code, I can see that it's evaluated on a ruffScore (Linting) and mypyScore (Type checking). Are there any metrics now, or in the works, for determining code logic/behaviour and consistency?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How is open-swe evaluated, and what is the current state of evaluation? #790

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How is open-swe evaluated, and what is the current state of evaluation? #790

Uh oh!

Uh oh!

arthurthlee Aug 21, 2025

Replies: 0 comments

arthurthlee
Aug 21, 2025