|
| 1 | +# Augment SWE-bench Verified Agent |
| 2 | + |
| 3 | +[SWE-bench Verified](https://www.swebench.com/) tests how well AI systems handle software engineering tasks pulled from actual GitHub issues in popular open-source projects. Some example problems can be found in OpenAI’s [original blog post on the benchmark](https://openai.com/index/introducing-swe-bench-verified/). Where most coding benchmarks focus on isolated Leetcode-style programming problems, SWE-bench involves codebase navigation, iterating against a suite of regression tests, and overall much more complexity. |
| 4 | + |
| 5 | +To achieve a 65.4% success rate on our first-ever SWE-bench submission we combined Claude Sonnet 3.7 as our core driver, along with OpenAI’s o1 as our ensembler. We deferred leveraging our own models to build a strong open-source baseline agent with off-the-shelf models. |
| 6 | + |
| 7 | +Since Anthropic's models are currently state-of-the-art on code, we used Claude Sonnet 3.7 as our agent's core driver, and we forked our agent system architecture from [Anthropic's own blog post about SWE-bench](https://www.anthropic.com/news/claude-3-7-sonnet). |
| 8 | + |
| 9 | +## Features |
| 10 | + |
| 11 | +- Small and simple coding agent implementation + SWE-bench docker harness that is super easy to run and build on top of. |
| 12 | +- Implementation of tools from our SWE-bench submission: |
| 13 | + - Bash command execution |
| 14 | + - File viewing and editing |
| 15 | + - Sequential thinking for complex problem-solving |
| 16 | +- Prompt template + system prompt from our SWE-bench submission. |
| 17 | +- Integration with Anthropic's Claude for core agent and OpenAI models for ensembling |
| 18 | +- Command approval management for safe execution |
| 19 | +- Majority vote ensembler for selecting the best solution from multiple candidates |
| 20 | +- Support for running agent in a Docker container |
| 21 | +- Support for running SWE-bench eval harness |
| 22 | + |
| 23 | +## Installation |
| 24 | + |
| 25 | +### Prerequisites |
| 26 | + |
| 27 | +- [Docker](https://www.docker.com/) (We tested with `Docker version 26.1.3, build 26.1.3-0ubuntu1~22.04.1`.) |
| 28 | +- Anthropic API key (for Claude models) |
| 29 | +- OpenAI API key (for OpenAI models) |
| 30 | + |
| 31 | +### Setup |
| 32 | + |
| 33 | +1. Clone the repository: |
| 34 | + ```bash |
| 35 | + git clone https://github.com/augmentcode/augment-swebench-agent.git |
| 36 | + cd augment-swebench-agent |
| 37 | + ``` |
| 38 | + |
| 39 | +2. Install dependencies: |
| 40 | + ```bash |
| 41 | + ./setup.sh |
| 42 | + source .venv/bin/activate |
| 43 | + ``` |
| 44 | + |
| 45 | +3. Set your API keys: |
| 46 | + ```bash |
| 47 | + # For Anthropic Claude models |
| 48 | + export ANTHROPIC_API_KEY=your_anthropic_api_key_here |
| 49 | + |
| 50 | + # For OpenAI models |
| 51 | + export OPENAI_API_KEY=your_openai_api_key_here |
| 52 | + ``` |
| 53 | + |
| 54 | +## Ways to use this repo |
| 55 | + |
| 56 | +- Interactive mode: Use `cli.py` to spin up an interactive agent for experimentation or as a personal coding assistant! |
| 57 | +- SWE-bench mode: Use `run_agent_on_swebench_problem.py` to run the agent on SWE-bench problems. This is similar to the script we used to generate our SWE-bench submission. |
| 58 | + |
| 59 | +More details both below! |
| 60 | + |
| 61 | +## Usage (interactive mode) |
| 62 | + |
| 63 | +Run the CLI interface to interact with the agent directly. By default, the agent will run |
| 64 | +in the current directory. |
| 65 | + |
| 66 | +```bash |
| 67 | +python cli.py |
| 68 | +``` |
| 69 | + |
| 70 | +This will start an interactive session where you can communicate with the agent. |
| 71 | + |
| 72 | +### Command-line Options |
| 73 | + |
| 74 | +- `--workspace`: Path to the workspace directory (default: current directory) |
| 75 | +- `--problem-statement`: Provide a problem statement to make the agent non-interactive (default: None) |
| 76 | +- `--needs-permission`: Whether to require permission before executing commands (default: False) |
| 77 | +- `--use-container-workspace`: Path to the shared volume that is mounted into the Docker container. This must be set if you are using `--docker-container-id`. (default: None) |
| 78 | +- `--docker-container-id`: ID of the Docker container to use. This must be set if you are using `--use-container-workspace`. (default: None) |
| 79 | + |
| 80 | +Example: |
| 81 | +```bash |
| 82 | +python cli.py --workspace /path/to/project --problem-statement "Fix the login issue" |
| 83 | +``` |
| 84 | + |
| 85 | +### Non-interactive Mode |
| 86 | + |
| 87 | +You can run the agent in non-interactive mode by providing a problem statement: |
| 88 | + |
| 89 | +```bash |
| 90 | +python cli.py --problem-statement "Implement a feature to sort items by date" |
| 91 | +``` |
| 92 | + |
| 93 | +### Using Docker |
| 94 | + |
| 95 | +If you want to use a Docker container for the workspace, you need to specify the path to the Docker container |
| 96 | +volume as well as the Docker container ID: |
| 97 | + |
| 98 | +```bash |
| 99 | +python cli.py --use-container-workspace --docker-container-id <container_id> --workspace /path/to/docker/volume |
| 100 | +``` |
| 101 | + |
| 102 | +## Usage (SWE-bench mode) |
| 103 | + |
| 104 | +### Quick Test Run |
| 105 | + |
| 106 | +As a test run, run the following. It will generate 2 candidate solutions for each of 5 problems. It will also run the evaluation step for each candidate solution. Finally, it will provide instructions for how to run ensembler on the results. |
| 107 | +```bash |
| 108 | +python run_agent_on_swebench_problem.py --num-examples 5 --num-candidate-solutions 2 |
| 109 | +``` |
| 110 | + |
| 111 | +You can increase `--num-examples` and `--num-candidate-solutions` to run on more problems and generate more candidate solutions. But be aware that this will take longer and cost more money. |
| 112 | + |
| 113 | +### Command-line Options |
| 114 | + |
| 115 | +- `--num-examples`: Number of examples to run on (default: None, which runs on all examples) |
| 116 | +- `--shard-ct`: Number of shards to split the work into (default: 1) |
| 117 | +- `--shard-id`: Shard ID to run (0-indexed, default: 0) |
| 118 | +- `--num-processes`: Number of processes to use for each example (default: 8) |
| 119 | +- `--num-candidate-solutions`: Number of candidate solutions to generate for each example (default: 8) |
| 120 | + |
| 121 | +### Running on more examples. |
| 122 | + |
| 123 | +There are 500 examples total in SWE-bench Verified. Note that this can take awhile, so there are a few levels of parallelism this repository supports. |
| 124 | +- Firstly, we suggest running 8 processes. This is the `--num-processes` flag. Beyond this, Docker hits issues. |
| 125 | +- Secondly, we support a notion of breaking up the dataset into shards. This is the `--shard-ct` and `--shard-id` flags. This makes it relatively easy to split up the work across multiple machines, which circumnvents the issues with scaling Docker byeond 8 processes. |
| 126 | + |
| 127 | +In our experiments, it took us a couple hours to run the full evaluation for 1 candidate solution per problem. This was |
| 128 | +with 10 shards split out across separate pods (managed by Kubernetes) and each pod had 8 processes. |
| 129 | + |
| 130 | +Keep in mind that you hit may hit rate-limits from Anthropic running 80 agents in parallel like we did. We have very high rate-limits with Anthropic's API that you may not have. Given this, you may have to run with a smaller `--shard-ct` and/or `--num-processes`. |
| 131 | + |
| 132 | +Suppose you want to run with 10 shards and 8 processes per shard, then that would mean you run the following command 10 times, varying the `--shard-id` flag from 0 to 9, on 10 different machines: |
| 133 | +```bash |
| 134 | +python run_agent_on_swebench_problem.py --shard-ct 10 --shard-id <worker_index> > logs.out 2> logs.err |
| 135 | +``` |
| 136 | + |
| 137 | +### Majority Vote Ensembler |
| 138 | + |
| 139 | +The Majority Vote Ensembler is a tool that helps select the best solution from multiple candidates using an LLM. It works by presenting multiple candidate solutions to a problem to OpenAI's o1 model and asking it to analyze and select the most common solution. |
| 140 | + |
| 141 | +#### How It Works |
| 142 | + |
| 143 | +1. The tool takes a JSON file containing problems, each with multiple candidate solutions (diffs) |
| 144 | +2. For each problem, it constructs a prompt using the `build_ensembler_prompt` function |
| 145 | +3. The prompt is sent to o1. |
| 146 | +4. The LLM analyzes all candidate solutions and selects the best one |
| 147 | +5. The tool extracts the selected solution index from the LLM's response |
| 148 | +6. Results are saved to a JSON file |
| 149 | + |
| 150 | +#### Usage |
| 151 | + |
| 152 | +```bash |
| 153 | +python majority_vote_ensembler.py path/to/input.jsonl --output_path path/to/output.json --workers 8 |
| 154 | +``` |
| 155 | + |
| 156 | +Where: |
| 157 | +- `path/to/input.jsonl` is a JSONL file containing problems and candidate solutions (see `example_ensembler_dataset.jsonl` for format) |
| 158 | +- `--output_path` specifies where to save the results |
| 159 | +- `--workers` sets the number of worker threads for parallel processing (default: 8) |
| 160 | + |
| 161 | +#### Example |
| 162 | + |
| 163 | +```bash |
| 164 | +python majority_vote_ensembler.py example_ensembler_data.jsonl --output_path example_ensembler_results.json |
| 165 | +``` |
| 166 | + |
| 167 | +#### Input Format |
| 168 | + |
| 169 | +The input JSONL file should contain a list of problem objects, each with the following structure: |
| 170 | + |
| 171 | +```json |
| 172 | +{ |
| 173 | + "id": "problem-1", |
| 174 | + "instruction": "Add a function to calculate factorial", |
| 175 | + "diffs": [ |
| 176 | + "```diff\n@@ -10,3 +10,10 @@\n def function():\n return x\n+\n+def new_function():\n+ return y\n```", |
| 177 | + "...other candidate solutions..." |
| 178 | + ], |
| 179 | + "eval_outcomes": [ |
| 180 | + { |
| 181 | + "is_success": true |
| 182 | + }, |
| 183 | + { |
| 184 | + "is_success": false |
| 185 | + }, |
| 186 | + { |
| 187 | + "is_success": true |
| 188 | + } |
| 189 | + ] |
| 190 | +} |
| 191 | +``` |
| 192 | + |
| 193 | +#### Output Format |
| 194 | + |
| 195 | +The output JSON file will contain an array of result objects, each with the following structure: |
| 196 | + |
| 197 | +```json |
| 198 | +[ |
| 199 | + { |
| 200 | + "id": "problem-1", |
| 201 | + "instruction": "Add a function to calculate factorial", |
| 202 | + "response": "[LLM's full response text]", |
| 203 | + "selected_diff_index": 2, |
| 204 | + "selected_diff": "[The selected diff content]", |
| 205 | + "is_eval_success": true |
| 206 | + } |
| 207 | +] |
| 208 | +``` |
| 209 | + |
| 210 | +## Development |
| 211 | + |
| 212 | +### Running Tests |
| 213 | + |
| 214 | +```bash |
| 215 | +pytest |
| 216 | +``` |
| 217 | + |
| 218 | +### Adding New Tools |
| 219 | + |
| 220 | +To add a new tool to the agent: |
| 221 | + |
| 222 | +1. Create a new tool class in the `tools/` directory |
| 223 | +2. Implement the required methods (run_impl, get_tool_param, etc.) |
| 224 | +3. Add the tool to the agent's tools list in `tools/agent.py` |
| 225 | + |
| 226 | +### Customizing the Majority Vote Ensembler |
| 227 | + |
| 228 | +You can customize the Majority Vote Ensembler by modifying: |
| 229 | + |
| 230 | +- `prompts/ensembler_prompt.py`: Change the prompt template used for ensembling |
| 231 | +- Change the LLM model by modifying the `get_client` call in `process_problem` function |
| 232 | + |
| 233 | +## License |
| 234 | + |
| 235 | +This project is licensed under the MIT License. |
0 commit comments