Skip to content

Commit 5d6e00e

Browse files
authored
Add reproduction script mode (#20)
* Adapt params and execution specs to allow reproduction script mode * Fix * Add grading/log-parser for reproduction script * Add documentation for reproduction script mode * Fix imports * Fix * Fix * Fix description
1 parent 5c85ecf commit 5d6e00e

File tree

14 files changed

+121
-79
lines changed

14 files changed

+121
-79
lines changed

README.md

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -62,12 +62,19 @@ python -m src.main \
6262
--run_id <run_id>
6363
# use --predictions_path 'gold' to verify the gold patches
6464
# use --run_id to name the evaluation run
65+
# use --exec_mode reproduction_script --reproduction_script_name <script_name> to run in reproduction script mode (see below)
6566
```
6667

6768
This command will generate docker build logs (`image_build_logs`) and evaluation logs (`run_instance_swt_logs`) in the current directory.
68-
6969
The final evaluation results will be stored in the `evaluation_results` directory.
7070

71+
### Unit Test mode vs. Reproduction Script mode
72+
73+
By default, SWT-Bench operates in unit test mode, where model predictions are treated as unit tests to be integrated into the existing test suite. The evaluation harness runs the modified parts of the test suite and reports changes to compute the success rate. Successful patches add a pass-to-fail test without causing existing tests to fail.
74+
75+
In the simpler reproduction script mode, model predictions are considered standalone scripts that reproduce issues. The evaluation harness runs the script on the codebase and determines success based on the script's exit code: 0 for pass and 1 for fail. The test suite is not executed in this mode.
76+
77+
7178
## Reporting results
7279

7380
To assess the result of a single run, we provide a simple script to assess a single evaluation run.
@@ -137,6 +144,8 @@ For our evaluation of OpenHands, we automatically discard all top-level files to
137144
Moreover, for the evaluation of the agent in the correct environment, we discard changes to `setup.py`, `pyproject.toml` and `requirements.txt` files, as they are changed by the test setup and conflict with the repeated evaluation.
138145
To find the exact setup used for OpenHands, check out the branch [`feat/CI`](https://github.com/logic-star-ai/swt-bench/tree/feat/CI).
139146

147+
AEGIS was evaluated in reproduction script mode.
148+
140149
## 🏗 Building SWT-Bench and Zero-Shot inference
141150

142151
To recreate the SWT-Bench dataset or create one with your own flavoring

docs/index.html

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -207,7 +207,7 @@ <h2 class="title is-4 is-spaced">News</h2>
207207
<tbody>
208208
<tr>
209209
<td>🆕&nbsp;<a href="https://arxiv.org/pdf/2411.18015">AEGIS</a><sup>&Dagger;</sup></td>
210-
<td>47.8%</td>
210+
<td>46.4%</td>
211211
<td>26.0%</td>
212212
<td><time>2025-02-17</time></td>
213213
<td><a href="https://files.sri.inf.ethz.ch/swt-bench/aegis/">🔗</a></td>
@@ -356,7 +356,7 @@ <h2 class="title is-4 is-spaced">News</h2>
356356
</div>
357357
<div class="columns is-max-desktop">
358358
<div class="column is-centered">
359-
<p class="is-size-7">The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. <sup>&Dagger;</sup> Generates stand-alone reproduction scripts and does not attempt integration into the test framework. <sup>#</sup> This approach leverages execution feedback from a correctly set-up <a title="Continuous Integration" href="https://en.wikipedia.org/wiki/Continuous_integration">CI</a> environment. </p>
359+
<p class="is-size-7">The results reported here are evaluation results on SWT-Bench Lite and Verified. We have independently executed submitted predictions for verification. <sup>&Dagger;</sup> Generates stand-alone reproduction scripts and does not attempt integration into the test framework. <sup>#</sup> Leverages execution feedback from a correctly set-up <a title="Continuous Integration" href="https://en.wikipedia.org/wiki/Continuous_integration">CI</a> environment. </p>
360360
</div>
361361
</div>
362362
</div>

src/__init__.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,12 @@
1-
__version__ = "2.0.2"
1+
__version__ = "1.2.0"
22

33
from src.constants import (
44
KEY_INSTANCE_ID,
55
KEY_MODEL,
66
KEY_PREDICTION,
77
MAP_REPO_TO_TEST_FRAMEWORK,
88
MAP_VERSION_TO_INSTALL,
9+
ResolvedStatus
910
)
1011

1112
from src.docker_build import (
@@ -33,7 +34,6 @@
3334
get_eval_report,
3435
get_pred_report,
3536
get_resolution_success,
36-
ResolvedStatus,
3737
TestStatus,
3838
)
3939

src/dataset.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import json
22
import pathlib
3-
from typing import List, Tuple, Optional, Dict, cast
3+
from typing import Dict, cast
44
import re
55
from datasets import load_dataset, Dataset, load_from_disk
66

src/docker_build.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
from tqdm import tqdm
99
from concurrent.futures import ThreadPoolExecutor, as_completed
1010
from pathlib import Path
11-
import os
1211
from docker.models.containers import Container
1312

1413
from src.constants import (
@@ -18,7 +17,6 @@
1817
MAP_VERSION_TO_INSTALL,
1918
)
2019
from src.test_spec import (
21-
get_test_specs_from_dataset,
2220
make_test_spec,
2321
TestSpec
2422
)
@@ -52,6 +50,7 @@ def __str__(self):
5250
)
5351

5452
BuildMode = Literal["cli", "api"]
53+
ExecMode = Literal["unit_test", "reproduction_script"]
5554

5655
def docker_build_cli(
5756
build_dir: Path,

src/docker_utils.py

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,4 @@
11
import pathlib
2-
import subprocess
3-
import tempfile
4-
from io import BytesIO
5-
from typing import Literal
62
import base64
73

84
import docker

src/dockerfiles.py

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,3 @@
1-
from functools import partial
2-
31
# IF you change the base image, you need to rebuild all images (run with --force_rebuild)
42
_DOCKERFILE_BASE = r"""
53
FROM --platform={platform} ubuntu:22.04

src/exec_spec.py

Lines changed: 31 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
import re
55

66
from dataclasses import dataclass, asdict
7-
from typing import Union, List, Optional
7+
from typing import Union, List, Optional, Literal
88

99
from src.constants import (
1010
SWEbenchInstance,
@@ -25,6 +25,8 @@
2525

2626
DIFF_MODIFIED_FILE_REGEX = r"--- a/(.*)"
2727

28+
ExecMode = Literal["unit_test", "reproduction_script"]
29+
2830

2931
@dataclass
3032
class ExecSpec:
@@ -50,6 +52,8 @@ class ExecSpec:
5052
rm_image: bool = False
5153
force_rebuild: bool = False
5254

55+
exec_mode: ExecMode = "unit_test"
56+
reproduction_script_name: Optional[str] = None
5357
compute_coverage: bool = False
5458

5559
@property
@@ -67,6 +71,20 @@ def as_dict(self):
6771

6872
@property
6973
def test_command(self):
74+
trace_path = "/root/trace.py"
75+
changed_files_pattern = "({})".format("|".join(re.escape(x) for x in self.coverage_files))
76+
trace_pattern = f"python3 {trace_path} --count -C coverage.cover --include-pattern '/testbed/{changed_files_pattern}'"
77+
78+
if self.exec_mode == "reproduction_script":
79+
reproduction_script_path = f"/testbed/{self.reproduction_script_name}"
80+
# executes just the reproduction script to determine the exit status
81+
test_command = f"python3 {reproduction_script_path}"
82+
if not self.compute_coverage:
83+
return test_command
84+
# executes the coverage script first to compute coverage, then the reproduction script to determine the exit status
85+
return f"{trace_pattern} {reproduction_script_path} && {test_command}"
86+
87+
# otherwise execute the test suite command
7088
test_command = " ".join(
7189
[
7290
MAP_REPO_TO_TEST_FRAMEWORK[self.repo][self.version],
@@ -76,10 +94,6 @@ def test_command(self):
7694
if not self.compute_coverage:
7795
return test_command
7896

79-
trace_path = "/root/trace.py"
80-
changed_files_pattern = "({})".format("|".join(re.escape(x) for x in self.coverage_files))
81-
trace_pattern = f"python3 {trace_path} --count -C coverage.cover --include-pattern '/testbed/{changed_files_pattern}'"
82-
8397
cleaned_test_cmd = test_command.replace("--tb=no", "")
8498

8599
if re.findall(r"python(3?) -m", cleaned_test_cmd):
@@ -255,12 +269,16 @@ def eval_script_list(self):
255269

256270
if "install" in install:
257271
eval_commands.append(install["install"])
272+
if self.exec_mode == "reproduction_script":
273+
exit_mode_command = ["echo $?"]
274+
else:
275+
exit_mode_command = []
258276

259277
if self.compute_coverage:
260278
cat_coverage_commands = ["cat coverage.cover"]
261279
else:
262280
cat_coverage_commands = []
263-
eval_commands += apply_patch_commands + [test_command] + cat_coverage_commands + reset_commands
281+
eval_commands += apply_patch_commands + [test_command] + exit_mode_command + cat_coverage_commands + reset_commands
264282

265283
return eval_commands
266284

@@ -352,7 +370,11 @@ def get_exec_specs_from_dataset(dataset: Union[list[SWEbenchInstance], list[Exec
352370
return list(map(make_exec_spec, dataset))
353371

354372

355-
def make_exec_spec(instance: SWEbenchInstance) -> ExecSpec:
373+
def make_exec_spec(
374+
instance: SWEbenchInstance,
375+
exec_mode: ExecMode = "unit_test",
376+
reproduction_script_name: Optional[str] = None,
377+
) -> ExecSpec:
356378
if isinstance(instance, ExecSpec):
357379
return instance
358380
instance_id = instance["instance_id"]
@@ -387,4 +409,6 @@ def make_exec_spec(instance: SWEbenchInstance) -> ExecSpec:
387409
test_directives=test_directives,
388410
patch_list=patch_list,
389411
coverage_files=changed_files,
412+
exec_mode=exec_mode,
413+
reproduction_script_name=reproduction_script_name,
390414
)

src/grading.py

Lines changed: 22 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@
44
import json
55
from unidiff import PatchSet
66

7+
from src.exec_spec import ExecMode
78
from src.constants import (
89
APPLY_PATCH_FAIL,
9-
APPLY_PATCH_PASS,
1010
FAIL_TO_FAIL,
1111
FAIL_TO_PASS,
1212
PASS_TO_FAIL,
@@ -15,11 +15,9 @@
1515
RESET_FAILED,
1616
TESTS_ERROR,
1717
TESTS_TIMEOUT,
18-
ResolvedStatus,
1918
TestStatus,
2019
)
21-
from src.test_spec import TestSpec
22-
from src.log_parsers import MAP_REPO_TO_PARSER
20+
from src.log_parsers import MAP_REPO_TO_PARSER, parse_log_reproduction_script
2321
from src.utils import get_log_dir, setup_logging
2422

2523
# MARK: Utility functions
@@ -45,27 +43,35 @@ def test_failed(case: str, sm: dict[str, str]) -> bool:
4543
)
4644

4745

48-
def get_logs_eval(log_fp: str, repo: str) -> tuple[dict[str, str], bool]:
46+
def get_logs_eval(
47+
log_fp: str,
48+
repo: str,
49+
exec_mode: ExecMode,
50+
) -> tuple[dict[str, str], bool]:
4951
"""
5052
Retrieve evaluation results for a task instance from its corresponding log file
5153
5254
Args:
5355
log_fp (str): path to log file
56+
repo (str): repository name
57+
exec_mode (ExecMode): execution mode
58+
reproduction_script_name (str): name of reproduction script
5459
Returns:
5560
bool: whether the patch applied successfully
5661
dict: status map
5762
5863
TODO(john-b-yang): Check this is working properly...
5964
"""
6065
# Convert e.g. "logs/scikit-learn__scikit-learn-12421/test_output.txt" to "scikit-learn/scikit-learn"
61-
log_parser = MAP_REPO_TO_PARSER[repo]
66+
log_parser = MAP_REPO_TO_PARSER[repo] if exec_mode != "reproduction_script" else parse_log_reproduction_script
6267

6368
if not Path(log_fp).exists():
6469
# likely due to a timeout
6570
return {}, False
6671
with open(log_fp) as f:
6772
raw_content = f.read()
6873
# remove installation logs
74+
# NOTE: does not work when not computing coverage
6975
content = re.split(r"\n\+ python3 [^\n]*trace.py --count -C coverage.cover [^\n]*\n", raw_content, flags=re.MULTILINE)[1]
7076
# remove coverage dumps
7177
content = content.split("\n+ cat coverage.cover")[0]
@@ -383,7 +389,15 @@ def get_pred_report(
383389
return report_map
384390

385391

386-
def report_results(patch_id: str, run_id: str, golden_code_patch, output_paths: Optional[List[str]], instance_id: str, repo: str) -> dict[str, dict[str, bool]]:
392+
def report_results(
393+
patch_id: str,
394+
run_id: str,
395+
golden_code_patch,
396+
output_paths: Optional[List[str]],
397+
instance_id: str,
398+
repo: str,
399+
exec_mode: ExecMode,
400+
) -> dict[str, dict[str, bool]]:
387401
log_dir = get_log_dir(run_id, patch_id, instance_id)
388402
logger, report_path = setup_logging(log_dir, instance_id)
389403

@@ -395,7 +409,7 @@ def report_results(patch_id: str, run_id: str, golden_code_patch, output_paths:
395409
patch_applied = []
396410
if output_paths is not None:
397411
for output_path in output_paths:
398-
test_result, patch_applied_ = get_logs_eval(output_path, repo)
412+
test_result, patch_applied_ = get_logs_eval(output_path, repo, exec_mode)
399413
patch_applied.append(patch_applied_)
400414
coverage_result = get_coverage_eval(output_path)
401415
test_results.append(test_result)

src/log_parsers.py

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,17 @@ def parse_log_matplotlib(log: str) -> dict[str, str]:
205205
test_status_map[test_case[1]] = test_case[0]
206206
return test_status_map
207207

208+
def parse_log_reproduction_script(log: str) -> dict[str, str]:
209+
"""
210+
If there is a nonzero exit code log a "main" test case with status "FAILED"
211+
"""
212+
exit_code = re.findall(r"^\+ echo (\d+)$", log, re.MULTILINE)
213+
if not exit_code:
214+
return {}
215+
name = "reproduction_script"
216+
status = TestStatus.PASSED.value if exit_code[0] == "0" else TestStatus.FAILED.value
217+
return {name: status}
218+
208219

209220
parse_log_astroid = parse_log_pytest
210221
parse_log_flask = parse_log_pytest

0 commit comments

Comments
 (0)