Skip to content

FSoft-AI4Code/SWE-EVO

Repository files navigation

SWE-EVO Logo

Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

Evaluate AI agents on realistic software evolution • Multi-step planning and adaptation • Long-horizon reasoning challenges

Paper License: MIT Python Issues Stars

IntroductionQuick StartHow It WorksEvaluationAcknowledgements


Introduction

SWE-EVO is a benchmark designed to evaluate AI coding agents in autonomous software evolution tasks. Unlike benchmarks that focus on isolated coding problems, SWE-EVO simulates realistic scenarios where agents must iteratively evolve complex codebases according to high-level software requirement specifications (SRS).

Using versioned histories from real Python open-source projects (such as Django and NumPy), SWE-EVO challenges agents to:

  • Interpret high-level software requirement specifications
  • Plan and implement multi-step changes
  • Navigate large-scale repositories with thousands of files
  • Produce correct changes across multiple versions

The Research Question

Given an existing codebase and evolving requirements, can AI agents autonomously perform sustained planning, adaptation, and evolution over long interactions?


Key Features

Feature Description
Realistic Tasks Derived from authentic project evolution histories, emphasizing change over time
Multi-Step Evaluation Agents must plan, update, and validate changes across versions
Modular Scaffolds Supports evaluation via OpenHands and SWE-agent
Public Dataset Curated instances with tools for reproducible evaluation
Long-Horizon Focus Challenges AI systems with iterative evolution and sustained reasoning

Quick Start

1. Clone the Repository

git clone https://github.com/FSoft-AI4Code/SWE-EVO.git
cd SWE-EVO

2. Install Dependencies

pip install -e .

3. Run Evaluation

python SWE-bench/evaluate_instance.py \
  --trajectories_path <path-to-your-trajectories> \
  --max_workers <num_workers> \
  --scaffold <scaffold_name>

How It Works

Software Evolution Model

Conceptual model of software evolution in SWE-EVO: from base system to evolved system through requirement interpretation and change execution.

Evolution Process

┌──────────────────┐
│   Base Codebase  │  Initial state of the repository
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   SRS Document   │  High-level requirements specification
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│   AI Agent       │  Plans and implements changes
└────────┬─────────┘
         │
         ↓
┌──────────────────┐
│ Evolved Codebase │  Updated repository matching requirements
└──────────────────┘

Evaluation

Using OpenHands Scaffold

1. Configure Your OpenHands Agent

cd OpenHands

Edit OpenHands/config.toml and add a new model block. You can leave api_key = "" and pass the real key through an environment variable (for example: export OPENAI_API_KEY=...).

Example:

[llm.your_model]
model = "your_model"
api_key = ""        # leave blank and export API_KEY
base_url = "your_url"
temperature = 0.0

2. Generate Trajectories

Use the OpenHands run_infer.sh script:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  [model_config] \
  [git_version] \
  [agent] \
  [eval_limit] \
  [num_workers] \
  [dataset_path] \
  [dataset_split] \
  [n_runs] \
  [mode]

Example:

./evaluation/benchmarks/swe_bench/scripts/run_infer.sh \
  llm.your_model \
  HEAD \
  CodeActAgent \
  48 \
  3 \
  your_project_path/SWE-EVO/hf_out/hf_jsonl \
  test \
  1 \
  swe

Notes:

  • model_config refers to the config block name you added (for example, llm.your_model)
  • For more information, see the OpenHands SWE-Bench instructions

3. Evaluate Your Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/openhands/outputs \
  --max_workers 8 \
  --scaffold OpenHands

Using SWE-agent Scaffold

1. Generate SWE-agent Trajectories

cd SWE-agent

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name [YOUR_MODEL] \
  --agent.model.api_key [YOUR_API_KEY] \
  --agent.model.api_base [YOUR_API_BASE] \
  --agent.model.reasoning_effort "[low|medium|high]" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split [dataset_split] \
  --instances.slice :1000 \
  --num_workers [num_workers] \
  --output_dir [output_dir]

Example:

MODEL="gpt-5-2025-08-07"

sweagent run-batch \
  --config config/default.yaml \
  --agent.model.name "$MODEL" \
  --agent.model.api_key "$OPENAI_API_KEY" \
  --agent.model.api_base "https://api.openai.com/v1" \
  --agent.model.reasoning_effort "medium" \
  --instances.type swe_bench \
  --instances.path_override "your_project_path/SWE-EVO/hf_out/hf_dataset" \
  --instances.split "test" \
  --instances.slice ":1000" \
  --num_workers 4 \
  --output_dir "trajectories/$MODEL"

Notes: Please refer to SWE-agent documentation for additional configuration details and advanced usage.


2. Evaluate the Results

After inference finishes, evaluate the generated trajectories:

python SWE-bench/evaluate_instance.py \
  --trajectories_path /path/to/sweagent/outputs \
  --max_workers 8 \
  --scaffold SWE-agent

Parameters

Parameter Description
--trajectories_path Path to your agent trajectory outputs
--max_workers Number of parallel workers for evaluation
--scaffold Scaffold name (OpenHands or SWE-agent)

Requirements


Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.


Acknowledgements

SWE-EVO builds on the original SWE-bench benchmark. We are grateful to the SWE-bench team for their foundational work in software engineering evaluation.

Special thanks to:

  • SWE-bench for pioneering software engineering benchmarks for AI
  • OpenHands for their open-source AI agent framework
  • SWE-agent for their agent scaffold and tooling
  • The open-source community behind Django, NumPy, and other projects used in this benchmark

License

MIT License - See LICENSE for details.


Citation

@article{sweevo2024,
  title={SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios},
  author={...},
  journal={arXiv preprint arXiv:XXXX.XXXXX},
  year={2024}
}

GitHubPaperIssues

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 48