SWE-AGI-Eval: Run Archive & Reports

This repository archives SWE-AGI evaluation runs (workspace snapshots + logs + metrics) for multiple models so results can be compared and reviewed later.

This repo is not the benchmark itself. For the benchmark definitions and task workspace , see the main SWE-AGI repo.

What’s in this repo

1) Archived runs

Each model has a top-level directory, and each task has a subdirectory:

SWE-AGI-Eval/
└── <model>/
    └── <task>/
        ├── (snapshot of the task workspace)
        ├── log.yaml
        ├── log.jsonl
        └── run-metrics.json

Typical files:

run-metrics.json: start/end time, elapsed time, exit code, and test summary (when available)
log.yaml: human-readable event log produced by the agent front-end
log.jsonl: raw event stream (useful for tooling)

2) Generated reports

All reports live in report/:

report/README.md: evaluation summary + per-task detailed results (tables)
report/behavior_stats.md: behavior statistics inferred from tool actions in log.yaml
report/BEHAVIOR_DEFINITIONS.md: category definitions used by behavior stats

Regenerate reports:

python3 report/eval_results_report.py
python3 report/behavior_stats.py

Adding a new archive (convention)

Create a top-level directory named after the model (optionally include config suffix).
Copy the SWE-AGI task workspace snapshot into ./<model>/<task>/.
Save logs + metrics into the same folder (at minimum: log.yaml + run-metrics.json).

Common commands

Inspect a run summary:

jq . <model>/<task>/run-metrics.json

Re-run tests for a snapshot locally (requires MoonBit tooling):

cd <model>/<task>
moon test

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
claude-opus-4.5		claude-opus-4.5
claude-opus-4.6		claude-opus-4.6
claude-sonnet-4.5		claude-sonnet-4.5
deepseek-v3.2		deepseek-v3.2
gemini-3-flash-preview		gemini-3-flash-preview
gemini-3-pro-preview		gemini-3-pro-preview
glm-4.7		glm-4.7
gpt-5.2-codex-high		gpt-5.2-codex-high
gpt-5.3-codex-xhigh		gpt-5.3-codex-xhigh
kimi-k2.5		kimi-k2.5
qwen3-max-2026-01-23		qwen3-max-2026-01-23
report		report
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SWE-AGI-Eval: Run Archive & Reports

What’s in this repo

1) Archived runs

2) Generated reports

Adding a new archive (convention)

Common commands

License

Copyright

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SWE-AGI-Eval: Run Archive & Reports

What’s in this repo

1) Archived runs

2) Generated reports

Adding a new archive (convention)

Common commands

License

Copyright

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages