Two-stage pipeline for running editing agents on SWE environments:
Stage 1: build/transfer images -> Stage 2: run agents
conda create -n inference python=3.13 -y
conda activate inference
pip install -r requirements-inference.txtAfter data collection, we typically have Dockerfiles for each instance. To run experiments, we build these images locally first. This stage also normalizes the environment so downstream evaluation is stable and less noisy. Common issues we handle here include:
- Reward/test execution taking too long (e.g. >300s).
- Repositories not mounted at
/testbed(e.g., code lives under/testbed/mypy). - Extra environment files under
/testbedcausing noisy diffs/patches (e.g.,/testbed/.venv).
Use the transfer agent runner (LLM-driven) to normalize the environment and build images,
then embed artifacts (Dockerfile + eval script) into a new dataset file. Use the
dataset produced by the SWE-Builder stage (the output of app/main.py under the
results directory) as-is.
Important
Stage 1 uses direct OpenAI-compatible chat completions (not LiteLLM).
export OPENAI_API_KEY="YOUR_API_KEY"
export OPENAI_BASE_URL="YOUR_URL" # optional overrideSet the model name via --model_name <model_name> (required).
python inference/build_image/main.py \
--input /path/to/instances.json \
--output /path/to/run_dir \
--max-iterations 5 \
--eval-timeout 300 \
--max-workers 2 \
--model_name <model_name>--eval-timeoutdefaults to 300 seconds if omitted. Instances that exceed the timeout are marked as failures and filtered out of the transferred dataset.
| Parameter | Meaning | Allowed / Example |
|---|---|---|
--input |
Path to raw instances JSON list | /path/to/instances.json |
--output |
Output directory for build artifacts | /path/to/run_dir |
--max-iterations |
LLM edit/build iterations per instance | 5 |
--eval-timeout |
Eval script timeout (seconds) | 300 |
--max-workers |
Parallel workers | 2 |
--skip-existing |
Skip instances with existing summary.json |
flag |
--model_name |
LLM model name | <model_name> |
Outputs in --output:
summary.json/summary_main.json<input_stem>_transferred.json(successful entries withdocker_image,dockerfile,eval_script). Thedocker_imagefield points to the built SWE environment image for each instance.<input_stem>_failed.json
Run against the transferred dataset produced in Stage 1.
| Agent | Scaffold | Tools | Notes | Recommended Use |
|---|---|---|---|---|
| mini_swe_agent | mini_swe_agent |
bash-only | non-fn only | multi-language / non-Python repos |
| live_swe_agent | live_swe_agent |
bash-only | non-fn only | multi-language / non-Python repos |
| DeepSWE (r2egym) | r2egym |
Python tools | fn + non-fn supported | Python repos |
| OpenHands (experimental, unofficial) | openhands |
Python tools | fn + non-fn supported | Python repos |
OpenHands diverges significantly from the original implementation, so please use it with caution.
The dataset should be the Stage 1 transferred output (for example,
/path/to/run_dir/<input_stem>_transferred.json). Each entry should include:
instance_id, docker_image, dockerfile, and eval_script.
Model calls go through LiteLLM. Set your base URL and provider API key before running (examples below).
export LLM_BASE_URL="YOUR_URL"
export OPENAI_API_KEY="YOUR_API_KEY"
export OPENROUTER_API_KEY="YOUR_OPENROUTER_KEY"Note
Run these commands from the repo root. If you run from another directory,
set PYTHONPATH to the repo path first.
mini_swe_agent:
python -m inference.agenthub.run.edit runagent_multiple \
--dataset /path/to/TRANSFERRED_DATASET.json \
--split dev \
--k 1 \
--start_idx 0 \
--max_workers 5 \
--traj_dir ./run_logs/mini_swe_run \
--exp_name mini_swe_run \
--llm_name openai/gpt-4o-mini \
--use_fn_calling False \
--backend docker \
--scaffold mini_swe_agentDeepSWE editing agent (r2egym scaffold):
python -m inference.agenthub.run.edit runagent_multiple \
--dataset /path/to/TRANSFERRED_DATASET.json \
--split dev \
--k 1 \
--start_idx 0 \
--max_workers 5 \
--traj_dir ./run_logs/deepswe_run \
--exp_name deepswe_run \
--llm_name openai/gpt-4o-mini \
--use_fn_calling True \
--backend docker \
--scaffold r2egymlive_swe_agent:
python -m inference.agenthub.run.edit runagent_multiple \
--dataset /path/to/TRANSFERRED_DATASET.json \
--split dev \
--k 1 \
--start_idx 0 \
--max_workers 5 \
--traj_dir ./run_logs/live_swe_run \
--exp_name live_swe_run \
--llm_name openai/gpt-4o-mini \
--use_fn_calling False \
--backend docker \
--scaffold live_swe_agentOpenHands (experimental, unofficial):
python -m inference.agenthub.run.edit runagent_multiple \
--dataset /path/to/TRANSFERRED_DATASET.json \
--split dev \
--k 1 \
--start_idx 0 \
--max_workers 5 \
--traj_dir ./run_logs/openhands_run \
--exp_name openhands_run \
--llm_name openai/gpt-4o-mini \
--use_fn_calling True \
--backend docker \
--scaffold openhandsNotes:
--splitis only a label for the local JSON loader; keep it consistent (e.g.,dev).- If you already have local images, you can skip Stage 1 and provide a dataset that
includes
instance_id,docker_image,dockerfile, andeval_script. - For local swefactory inference, ensure your
docker_imagename containsswefactoryso the runtime selects swefactory mode. - For
r2egymandopenhands, you can use either--use_fn_calling TrueorFalse. UseTrueonly if your model/provider returns tool calls; there is no auto fallback. --backendcurrently supportsdockeronly.- If you use a non-OpenAI provider, set the matching API key env var (for example,
ANTHROPIC_API_KEY). - This codebase is built on top of R2E-Gym; thanks to the original authors.
Each run writes artifacts to --traj_dir. For example:
./run_logs/my_run
Directory layout:
run_logs/<exp_name>/
<exp_name>.jsonl
trajectories.jsonl
trajectories_rejection_sampling.jsonl
reward_summary.json
<instance_id>/
agent.log
output_patch.diff
test_output.log
metadata.json
History-only files:
trajectories.jsonlandtrajectories_rejection_sampling.jsonl- non-fn-calling: each line is a raw
messageslist - fn-calling: each line is
{"messages": [...], "tools": [...]}(tools schema matches the model call)
- non-fn-calling: each line is a raw
| Parameter | Meaning | Allowed / Example |
|---|---|---|
--dataset |
Path to Stage 1 transferred dataset | /path/to/TRANSFERRED_DATASET.json |
--split |
Label for local JSON loader | dev |
--k |
Number of instances to run | 1 |
--start_idx |
Start index in dataset | 0 |
--max_workers |
Parallel workers | 5 |
--traj_dir |
Output directory for logs/artifacts | ./run_logs/my_run |
--exp_name |
Experiment name | my_run |
--llm_name |
LiteLLM model name | openai/gpt-4o-mini |
--use_fn_calling |
Function-calling mode | True or False (depends on scaffold + model support) |
--backend |
Runtime backend | docker |
--scaffold |
Agent scaffold | mini_swe_agent / r2egym / live_swe_agent / openhands |
Besides the built-in agenthub scaffolds above, you can also evaluate
Terminal-Bench style agents with Harbor on SWE-Factory environments.
pip install harborHarbor does not read the SWE-Factory dataset directly. First convert your transferred dataset into Harbor task format, then run Harbor on the converted task directory.
python inference/build_image/convert_to_harbor_format.py \
--input /path/to/TRANSFERRED_DATASET.json \
--out-dir ./harborThis generates a Harbor-compatible task directory like:
./harbor/
<task_id>/
instruction.md
task.toml
environment/Dockerfile
tests/test.sh
tests/eval.sh
solution/solve.sh
export OPENROUTER_API_KEY="YOUR_OPENROUTER_KEY"
harbor run \
-p "$(pwd)/harbor" \
-a terminus-2 \
-m "openrouter/stepfun/step-3.5-flash" \
-n 4 \
-l 10Notes:
- Harbor support is a separate evaluation path from
agenthub; it is mainly for running Terminal-Bench compatible agents. - Conversion is required before running Harbor.
- Harbor consumes the same Stage 1 transferred dataset format used by Stage 2.
The inference/agenthub module is developed on top of R2E-Gym. Thanks to the original authors for their work.