You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+10-1Lines changed: 10 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -62,12 +62,19 @@ python -m src.main \
62
62
--run_id <run_id>
63
63
# use --predictions_path 'gold' to verify the gold patches
64
64
# use --run_id to name the evaluation run
65
+
# use --exec_mode reproduction_script --reproduction_script_name <script_name> to run in reproduction script mode (see below)
65
66
```
66
67
67
68
This command will generate docker build logs (`image_build_logs`) and evaluation logs (`run_instance_swt_logs`) in the current directory.
68
-
69
69
The final evaluation results will be stored in the `evaluation_results` directory.
70
70
71
+
### Unit Test mode vs. Reproduction Script mode
72
+
73
+
By default, SWT-Bench operates in unit test mode, where model predictions are treated as unit tests to be integrated into the existing test suite. The evaluation harness runs the modified parts of the test suite and reports changes to compute the success rate. Successful patches add a pass-to-fail test without causing existing tests to fail.
74
+
75
+
In the simpler reproduction script mode, model predictions are considered standalone scripts that reproduce issues. The evaluation harness runs the script on the codebase and determines success based on the script's exit code: 0 for pass and 1 for fail. The test suite is not executed in this mode.
76
+
77
+
71
78
## Reporting results
72
79
73
80
To assess the result of a single run, we provide a simple script to assess a single evaluation run.
@@ -137,6 +144,8 @@ For our evaluation of OpenHands, we automatically discard all top-level files to
137
144
Moreover, for the evaluation of the agent in the correct environment, we discard changes to `setup.py`, `pyproject.toml` and `requirements.txt` files, as they are changed by the test setup and conflict with the repeated evaluation.
138
145
To find the exact setup used for OpenHands, check out the branch [`feat/CI`](https://github.com/logic-star-ai/swt-bench/tree/feat/CI).
139
146
147
+
AEGIS was evaluated in reproduction script mode.
148
+
140
149
## 🏗 Building SWT-Bench and Zero-Shot inference
141
150
142
151
To recreate the SWT-Bench dataset or create one with your own flavoring
<pclass="is-size-7">The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. <sup>‡</sup> Generates stand-alone reproduction scripts and does not attempt integration into the test framework. <sup>#</sup>This approach leverages execution feedback from a correctly set-up <atitle="Continuous Integration" href="https://en.wikipedia.org/wiki/Continuous_integration">CI</a> environment. </p>
359
+
<pclass="is-size-7">The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. <sup>‡</sup> Generates stand-alone reproduction scripts and does not attempt integration into the test framework. <sup>#</sup>Leverages execution feedback from a correctly set-up <atitle="Continuous Integration" href="https://en.wikipedia.org/wiki/Continuous_integration">CI</a> environment. </p>
0 commit comments