GitHub - FSoft-AI4Code/SWE-EVO

Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Evaluate AI agents on realistic software evolution • Multi-step planning and adaptation • Long-horizon reasoning challenges

Introduction • Quick Start • How It Works • Evaluation • Acknowledgements

Introduction

SWE-EVO is a benchmark designed to evaluate AI coding agents in autonomous software evolution tasks. Unlike benchmarks that focus on isolated coding problems, SWE-EVO simulates realistic scenarios where agents must iteratively evolve complex codebases according to high-level software requirement specifications (SRS).

Using versioned histories from real Python open-source projects (such as Django and NumPy), SWE-EVO challenges agents to:

Interpret high-level software requirement specifications
Plan and implement multi-step changes
Navigate large-scale repositories with thousands of files
Produce correct changes across multiple versions

The Research Question

Given an existing codebase and evolving requirements, can AI agents autonomously perform sustained planning, adaptation, and evolution over long interactions?

Key Features

Feature	Description
Realistic Tasks	Derived from authentic project evolution histories, emphasizing change over time
Multi-Step Evaluation	Agents must plan, update, and validate changes across versions
Modular Scaffolds	Supports evaluation via OpenHands and SWE-agent
Public Dataset	Curated instances with tools for reproducible evaluation
Long-Horizon Focus	Challenges AI systems with iterative evolution and sustained reasoning

Quick Start

1. Clone the Repository

git clone https://github.com/FSoft-AI4Code/SWE-EVO.git
cd SWE-EVO

2. Install Dependencies

pip install -e .

3. Run Evaluation

python SWE-bench/evaluate_instance.py \
  --trajectories_path <path-to-your-trajectories> \
  --max_workers <num_workers> \
  --scaffold <scaffold_name>

How It Works

Conceptual model of software evolution in SWE-EVO: from base system to evolved system through requirement interpretation and change execution.

Evolution Process

┌──────────────────┐
│   Base Codebase  │  Initial state of the repository
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   SRS Document   │  High-level requirements specification
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   AI Agent       │  Plans and implements changes
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│ Evolved Codebase │  Updated repository matching requirements
└──────────────────┘

Evaluation

Using OpenHands Scaffold

1. Configure Your OpenHands Agent

cd OpenHands

Edit OpenHands/config.toml and add a new model block. You can leave api_key = "" and pass the real key through an environment variable (for example: export OPENAI_API_KEY=...).

Example:

[llm.your_model]
model = "your_model"
api_key = ""        # leave blank and export API_KEY
base_url = "your_url"
temperature = 0.0

2. Generate Trajectories

Use the OpenHands run_infer.sh script:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  [model_config] \
  [git_version] \
  [agent] \
  [eval_limit] \
  [num_workers] \
  [dataset_path] \
  [dataset_split] \
  [n_runs] \
  [mode]

Example:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  llm.your_model \
  HEAD \
  CodeActAgent \
  48 \
  3 \
  your_project_path/SWE-EVO/hf_out/hf_jsonl \
  test \
  1 \
  swe

Notes:

model_config refers to the config block name you added (for example, llm.your_model)
For more information, see the OpenHands SWE-Bench instructions

3. Evaluate Your Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/openhands/outputs \
  --max_workers 8 \
  --scaffold OpenHands

Using SWE-agent Scaffold

1. Generate SWE-agent Trajectories

cd SWE-agent

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name [YOUR_MODEL] \
  --agent.model.api_key [YOUR_API_KEY] \
  --agent.model.api_base [YOUR_API_BASE] \
  --agent.model.reasoning_effort "[low|medium|high]" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split [dataset_split] \
  --instances.slice :1000 \
  --num_workers [num_workers] \
  --output_dir [output_dir]

Example:

MODEL="gpt-5-2025-08-07"

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name "$MODEL" \
  --agent.model.api_key "$OPENAI_API_KEY" \
  --agent.model.api_base "https://api.openai.com/v1" \
  --agent.model.reasoning_effort "medium" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split "test" \
  --instances.slice ":1000" \
  --num_workers 4 \
  --output_dir "trajectories/$MODEL"

Notes: Please refer to SWE-agent documentation for additional configuration details and advanced usage.

2. Evaluate the Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/sweagent/outputs \
  --max_workers 8 \
  --scaffold SWE-agent

Parameters

Parameter	Description
`--trajectories_path`	Path to your agent trajectory outputs
`--max_workers`	Number of parallel workers for evaluation
`--scaffold`	Scaffold name (`OpenHands` or `SWE-agent`)

Requirements

Python 3.10+
Compatible scaffold installation (OpenHands or SWE-agent)

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

Acknowledgements

SWE-EVO builds on the original SWE-bench benchmark. We are grateful to the SWE-bench team for their foundational work in software engineering evaluation.

Special thanks to:

SWE-bench for pioneering software engineering benchmarks for AI
OpenHands for their open-source AI agent framework
SWE-agent for their agent scaffold and tooling
The open-source community behind Django, NumPy, and other projects used in this benchmark

License

MIT License - See LICENSE for details.

Citation

@article{sweevo2024,
  title={SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios},
  author={...},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

GitHub • Paper • Issues

Name		Name	Last commit message	Last commit date
Latest commit History 631 Commits
OpenHands		OpenHands
SWE-agent		SWE-agent
SWE-bench		SWE-bench
_release_note		_release_note
docs		docs
hf_out/hf_dataset/test		hf_out/hf_dataset/test
img		img
logo		logo
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
check_tag_image.py		check_tag_image.py
codecov.yml		codecov.yml
craw_wiki_markdown.py		craw_wiki_markdown.py
create_sample.sh		create_sample.sh
dev.README.md		dev.README.md
make_hf_dataset.py		make_hf_dataset.py
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

The Research Question

Key Features

Quick Start

1. Clone the Repository

2. Install Dependencies

3. Run Evaluation

How It Works

Evolution Process

Evaluation

Using OpenHands Scaffold

1. Configure Your OpenHands Agent

2. Generate Trajectories

3. Evaluate Your Results

Using SWE-agent Scaffold

1. Generate SWE-agent Trajectories

2. Evaluate the Results

Parameters

Requirements

Contributing

Acknowledgements

License

Citation

About

Uh oh!

Releases

Packages

Contributors 48

Uh oh!

Languages

License

FSoft-AI4Code/SWE-EVO

Folders and files

Latest commit

History

Repository files navigation

Introduction

The Research Question

Key Features

Quick Start

1. Clone the Repository

2. Install Dependencies

3. Run Evaluation

How It Works

Evolution Process

Evaluation

Using OpenHands Scaffold

1. Configure Your OpenHands Agent

2. Generate Trajectories

3. Evaluate Your Results

Using SWE-agent Scaffold

1. Generate SWE-agent Trajectories

2. Evaluate the Results

Parameters

Requirements

Contributing

Acknowledgements

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 48

Uh oh!

Languages

Packages