Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
Evaluate AI agents on realistic software evolution • Multi-step planning and adaptation • Long-horizon reasoning challenges
Introduction • Quick Start • How It Works • Evaluation • Acknowledgements
SWE-EVO is a benchmark designed to evaluate AI coding agents in autonomous software evolution tasks. Unlike benchmarks that focus on isolated coding problems, SWE-EVO simulates realistic scenarios where agents must iteratively evolve complex codebases according to high-level software requirement specifications (SRS).
Using versioned histories from real Python open-source projects (such as Django and NumPy), SWE-EVO challenges agents to:
- Interpret high-level software requirement specifications
- Plan and implement multi-step changes
- Navigate large-scale repositories with thousands of files
- Produce correct changes across multiple versions
Given an existing codebase and evolving requirements, can AI agents autonomously perform sustained planning, adaptation, and evolution over long interactions?
| Feature | Description |
|---|---|
| Realistic Tasks | Derived from authentic project evolution histories, emphasizing change over time |
| Multi-Step Evaluation | Agents must plan, update, and validate changes across versions |
| Modular Scaffolds | Supports evaluation via OpenHands and SWE-agent |
| Public Dataset | Curated instances with tools for reproducible evaluation |
| Long-Horizon Focus | Challenges AI systems with iterative evolution and sustained reasoning |
git clone https://github.com/FSoft-AI4Code/SWE-EVO.git
cd SWE-EVOpip install -e .python SWE-bench/evaluate_instance.py \
--trajectories_path <path-to-your-trajectories> \
--max_workers <num_workers> \
--scaffold <scaffold_name>Conceptual model of software evolution in SWE-EVO: from base system to evolved system through requirement interpretation and change execution.
┌──────────────────┐
│ Base Codebase │ Initial state of the repository
└────────┬─────────┘
│
↓
┌──────────────────┐
│ SRS Document │ High-level requirements specification
└────────┬─────────┘
│
↓
┌──────────────────┐
│ AI Agent │ Plans and implements changes
└────────┬─────────┘
│
↓
┌──────────────────┐
│ Evolved Codebase │ Updated repository matching requirements
└──────────────────┘
cd OpenHandsEdit OpenHands/config.toml and add a new model block. You can leave api_key = "" and pass the real key through an environment variable (for example: export OPENAI_API_KEY=...).
Example:
[llm.your_model]
model = "your_model"
api_key = "" # leave blank and export API_KEY
base_url = "your_url"
temperature = 0.0Use the OpenHands run_infer.sh script:
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
[model_config] \
[git_version] \
[agent] \
[eval_limit] \
[num_workers] \
[dataset_path] \
[dataset_split] \
[n_runs] \
[mode]Example:
./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
llm.your_model \
HEAD \
CodeActAgent \
48 \
3 \
your_project_path/SWE-EVO/hf_out/hf_jsonl \
test \
1 \
sweNotes:
model_configrefers to the config block name you added (for example,llm.your_model)- For more information, see the OpenHands SWE-Bench instructions
After inference finishes, evaluate the generated trajectories:
python SWE-bench/evaluate_instance.py \
--trajectories_path /path/to/openhands/outputs \
--max_workers 8 \
--scaffold OpenHandscd SWE-agent
sweagent run-batch \
--config config/default.yaml \
--agent.model.name [YOUR_MODEL] \
--agent.model.api_key [YOUR_API_KEY] \
--agent.model.api_base [YOUR_API_BASE] \
--agent.model.reasoning_effort "[low|medium|high]" \
--instances.type swe_bench \
--instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
--instances.split [dataset_split] \
--instances.slice :1000 \
--num_workers [num_workers] \
--output_dir [output_dir]Example:
MODEL="gpt-5-2025-08-07"
sweagent run-batch \
--config config/default.yaml \
--agent.model.name "$MODEL" \
--agent.model.api_key "$OPENAI_API_KEY" \
--agent.model.api_base "https://api.openai.com/v1" \
--agent.model.reasoning_effort "medium" \
--instances.type swe_bench \
--instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
--instances.split "test" \
--instances.slice ":1000" \
--num_workers 4 \
--output_dir "trajectories/$MODEL"Notes: Please refer to SWE-agent documentation for additional configuration details and advanced usage.
After inference finishes, evaluate the generated trajectories:
python SWE-bench/evaluate_instance.py \
--trajectories_path /path/to/sweagent/outputs \
--max_workers 8 \
--scaffold SWE-agent| Parameter | Description |
|---|---|
--trajectories_path |
Path to your agent trajectory outputs |
--max_workers |
Number of parallel workers for evaluation |
--scaffold |
Scaffold name (OpenHands or SWE-agent) |
Contributions are welcome! Please feel free to submit issues or pull requests.
SWE-EVO builds on the original SWE-bench benchmark. We are grateful to the SWE-bench team for their foundational work in software engineering evaluation.
Special thanks to:
- SWE-bench for pioneering software engineering benchmarks for AI
- OpenHands for their open-source AI agent framework
- SWE-agent for their agent scaffold and tooling
- The open-source community behind Django, NumPy, and other projects used in this benchmark
MIT License - See LICENSE for details.
@article{sweevo2024,
title={SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios},
author={...},
journal={arXiv preprint arXiv:XXXX.XXXXX},
year={2024}
}
