Skip to content

Commit 091b6e6

Browse files
Colin FlahertyColin Flaherty
authored andcommitted
Complete implementation
1 parent 8597ca7 commit 091b6e6

38 files changed

+8644
-0
lines changed

.gitignore

Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
*$py.class
5+
6+
# C extensions
7+
*.so
8+
9+
# Distribution / packaging
10+
.Python
11+
build/
12+
develop-eggs/
13+
dist/
14+
downloads/
15+
eggs/
16+
.eggs/
17+
lib/
18+
lib64/
19+
parts/
20+
sdist/
21+
var/
22+
wheels/
23+
*.egg-info/
24+
.installed.cfg
25+
*.egg
26+
27+
# Virtual environments
28+
.env
29+
.venv
30+
env/
31+
venv/
32+
ENV/
33+
env.bak/
34+
venv.bak/
35+
36+
# Unit test / coverage reports
37+
htmlcov/
38+
.tox/
39+
.nox/
40+
.coverage
41+
.coverage.*
42+
.cache
43+
nosetests.xml
44+
coverage.xml
45+
*.cover
46+
.hypothesis/
47+
.pytest_cache/
48+
49+
# Jupyter Notebook
50+
.ipynb_checkpoints
51+
52+
# IDE specific files
53+
.idea/
54+
.vscode/
55+
*.swp
56+
*.swo
57+
.DS_Store

.python-version

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
3.11

LICENSE

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,3 +19,15 @@ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
1919
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
2020
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
2121
SOFTWARE.
22+
23+
---
24+
25+
Third-Party Notices:
26+
27+
This software includes portions derived from code originally developed by Anthropic,
28+
licensed under the MIT License. The original code has been modified.
29+
30+
Original copyright:
31+
© 2024 Anthropic, PBC
32+
33+
Original license: https://github.com/modelcontextprotocol/servers/blob/main/LICENSE

README.md

Lines changed: 235 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,235 @@
1+
# Augment SWE-bench Verified Agent
2+
3+
[SWE-bench Verified](https://www.swebench.com/) tests how well AI systems handle software engineering tasks pulled from actual GitHub issues in popular open-source projects. Some example problems can be found in OpenAI’s [original blog post on the benchmark](https://openai.com/index/introducing-swe-bench-verified/). Where most coding benchmarks focus on isolated Leetcode-style programming problems, SWE-bench involves codebase navigation, iterating against a suite of regression tests, and overall much more complexity.
4+
5+
To achieve a 65.4% success rate on our first-ever SWE-bench submission we combined Claude Sonnet 3.7 as our core driver, along with OpenAI’s o1 as our ensembler. We deferred leveraging our own models to build a strong open-source baseline agent with off-the-shelf models.
6+
7+
Since Anthropic's models are currently state-of-the-art on code, we used Claude Sonnet 3.7 as our agent's core driver, and we forked our agent system architecture from [Anthropic's own blog post about SWE-bench](https://www.anthropic.com/news/claude-3-7-sonnet).
8+
9+
## Features
10+
11+
- Small and simple coding agent implementation + SWE-bench docker harness that is super easy to run and build on top of.
12+
- Implementation of tools from our SWE-bench submission:
13+
- Bash command execution
14+
- File viewing and editing
15+
- Sequential thinking for complex problem-solving
16+
- Prompt template + system prompt from our SWE-bench submission.
17+
- Integration with Anthropic's Claude for core agent and OpenAI models for ensembling
18+
- Command approval management for safe execution
19+
- Majority vote ensembler for selecting the best solution from multiple candidates
20+
- Support for running agent in a Docker container
21+
- Support for running SWE-bench eval harness
22+
23+
## Installation
24+
25+
### Prerequisites
26+
27+
- [Docker](https://www.docker.com/) (We tested with `Docker version 26.1.3, build 26.1.3-0ubuntu1~22.04.1`.)
28+
- Anthropic API key (for Claude models)
29+
- OpenAI API key (for OpenAI models)
30+
31+
### Setup
32+
33+
1. Clone the repository:
34+
```bash
35+
git clone https://github.com/augmentcode/augment-swebench-agent.git
36+
cd augment-swebench-agent
37+
```
38+
39+
2. Install dependencies:
40+
```bash
41+
./setup.sh
42+
source .venv/bin/activate
43+
```
44+
45+
3. Set your API keys:
46+
```bash
47+
# For Anthropic Claude models
48+
export ANTHROPIC_API_KEY=your_anthropic_api_key_here
49+
50+
# For OpenAI models
51+
export OPENAI_API_KEY=your_openai_api_key_here
52+
```
53+
54+
## Ways to use this repo
55+
56+
- Interactive mode: Use `cli.py` to spin up an interactive agent for experimentation or as a personal coding assistant!
57+
- SWE-bench mode: Use `run_agent_on_swebench_problem.py` to run the agent on SWE-bench problems. This is similar to the script we used to generate our SWE-bench submission.
58+
59+
More details both below!
60+
61+
## Usage (interactive mode)
62+
63+
Run the CLI interface to interact with the agent directly. By default, the agent will run
64+
in the current directory.
65+
66+
```bash
67+
python cli.py
68+
```
69+
70+
This will start an interactive session where you can communicate with the agent.
71+
72+
### Command-line Options
73+
74+
- `--workspace`: Path to the workspace directory (default: current directory)
75+
- `--problem-statement`: Provide a problem statement to make the agent non-interactive (default: None)
76+
- `--needs-permission`: Whether to require permission before executing commands (default: False)
77+
- `--use-container-workspace`: Path to the shared volume that is mounted into the Docker container. This must be set if you are using `--docker-container-id`. (default: None)
78+
- `--docker-container-id`: ID of the Docker container to use. This must be set if you are using `--use-container-workspace`. (default: None)
79+
80+
Example:
81+
```bash
82+
python cli.py --workspace /path/to/project --problem-statement "Fix the login issue"
83+
```
84+
85+
### Non-interactive Mode
86+
87+
You can run the agent in non-interactive mode by providing a problem statement:
88+
89+
```bash
90+
python cli.py --problem-statement "Implement a feature to sort items by date"
91+
```
92+
93+
### Using Docker
94+
95+
If you want to use a Docker container for the workspace, you need to specify the path to the Docker container
96+
volume as well as the Docker container ID:
97+
98+
```bash
99+
python cli.py --use-container-workspace --docker-container-id <container_id> --workspace /path/to/docker/volume
100+
```
101+
102+
## Usage (SWE-bench mode)
103+
104+
### Quick Test Run
105+
106+
As a test run, run the following. It will generate 2 candidate solutions for each of 5 problems. It will also run the evaluation step for each candidate solution. Finally, it will provide instructions for how to run ensembler on the results.
107+
```bash
108+
python run_agent_on_swebench_problem.py --num-examples 5 --num-candidate-solutions 2
109+
```
110+
111+
You can increase `--num-examples` and `--num-candidate-solutions` to run on more problems and generate more candidate solutions. But be aware that this will take longer and cost more money.
112+
113+
### Command-line Options
114+
115+
- `--num-examples`: Number of examples to run on (default: None, which runs on all examples)
116+
- `--shard-ct`: Number of shards to split the work into (default: 1)
117+
- `--shard-id`: Shard ID to run (0-indexed, default: 0)
118+
- `--num-processes`: Number of processes to use for each example (default: 8)
119+
- `--num-candidate-solutions`: Number of candidate solutions to generate for each example (default: 8)
120+
121+
### Running on more examples.
122+
123+
There are 500 examples total in SWE-bench Verified. Note that this can take awhile, so there are a few levels of parallelism this repository supports.
124+
- Firstly, we suggest running 8 processes. This is the `--num-processes` flag. Beyond this, Docker hits issues.
125+
- Secondly, we support a notion of breaking up the dataset into shards. This is the `--shard-ct` and `--shard-id` flags. This makes it relatively easy to split up the work across multiple machines, which circumnvents the issues with scaling Docker byeond 8 processes.
126+
127+
In our experiments, it took us a couple hours to run the full evaluation for 1 candidate solution per problem. This was
128+
with 10 shards split out across separate pods (managed by Kubernetes) and each pod had 8 processes.
129+
130+
Keep in mind that you hit may hit rate-limits from Anthropic running 80 agents in parallel like we did. We have very high rate-limits with Anthropic's API that you may not have. Given this, you may have to run with a smaller `--shard-ct` and/or `--num-processes`.
131+
132+
Suppose you want to run with 10 shards and 8 processes per shard, then that would mean you run the following command 10 times, varying the `--shard-id` flag from 0 to 9, on 10 different machines:
133+
```bash
134+
python run_agent_on_swebench_problem.py --shard-ct 10 --shard-id <worker_index> > logs.out 2> logs.err
135+
```
136+
137+
### Majority Vote Ensembler
138+
139+
The Majority Vote Ensembler is a tool that helps select the best solution from multiple candidates using an LLM. It works by presenting multiple candidate solutions to a problem to OpenAI's o1 model and asking it to analyze and select the most common solution.
140+
141+
#### How It Works
142+
143+
1. The tool takes a JSON file containing problems, each with multiple candidate solutions (diffs)
144+
2. For each problem, it constructs a prompt using the `build_ensembler_prompt` function
145+
3. The prompt is sent to o1.
146+
4. The LLM analyzes all candidate solutions and selects the best one
147+
5. The tool extracts the selected solution index from the LLM's response
148+
6. Results are saved to a JSON file
149+
150+
#### Usage
151+
152+
```bash
153+
python majority_vote_ensembler.py path/to/input.jsonl --output_path path/to/output.json --workers 8
154+
```
155+
156+
Where:
157+
- `path/to/input.jsonl` is a JSONL file containing problems and candidate solutions (see `example_ensembler_dataset.jsonl` for format)
158+
- `--output_path` specifies where to save the results
159+
- `--workers` sets the number of worker threads for parallel processing (default: 8)
160+
161+
#### Example
162+
163+
```bash
164+
python majority_vote_ensembler.py example_ensembler_data.jsonl --output_path example_ensembler_results.json
165+
```
166+
167+
#### Input Format
168+
169+
The input JSONL file should contain a list of problem objects, each with the following structure:
170+
171+
```json
172+
{
173+
"id": "problem-1",
174+
"instruction": "Add a function to calculate factorial",
175+
"diffs": [
176+
"```diff\n@@ -10,3 +10,10 @@\n def function():\n return x\n+\n+def new_function():\n+ return y\n```",
177+
"...other candidate solutions..."
178+
],
179+
"eval_outcomes": [
180+
{
181+
"is_success": true
182+
},
183+
{
184+
"is_success": false
185+
},
186+
{
187+
"is_success": true
188+
}
189+
]
190+
}
191+
```
192+
193+
#### Output Format
194+
195+
The output JSON file will contain an array of result objects, each with the following structure:
196+
197+
```json
198+
[
199+
{
200+
"id": "problem-1",
201+
"instruction": "Add a function to calculate factorial",
202+
"response": "[LLM's full response text]",
203+
"selected_diff_index": 2,
204+
"selected_diff": "[The selected diff content]",
205+
"is_eval_success": true
206+
}
207+
]
208+
```
209+
210+
## Development
211+
212+
### Running Tests
213+
214+
```bash
215+
pytest
216+
```
217+
218+
### Adding New Tools
219+
220+
To add a new tool to the agent:
221+
222+
1. Create a new tool class in the `tools/` directory
223+
2. Implement the required methods (run_impl, get_tool_param, etc.)
224+
3. Add the tool to the agent's tools list in `tools/agent.py`
225+
226+
### Customizing the Majority Vote Ensembler
227+
228+
You can customize the Majority Vote Ensembler by modifying:
229+
230+
- `prompts/ensembler_prompt.py`: Change the prompt template used for ensembling
231+
- Change the LLM model by modifying the `get_client` call in `process_problem` function
232+
233+
## License
234+
235+
This project is licensed under the MIT License.

__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Root package."""

0 commit comments

Comments
 (0)