This repository archives SWE-AGI evaluation runs (workspace snapshots + logs + metrics) for multiple models so results can be compared and reviewed later.
This repo is not the benchmark itself. For the benchmark definitions and task workspace , see the main SWE-AGI repo.
Each model has a top-level directory, and each task has a subdirectory:
SWE-AGI-Eval/
└── <model>/
└── <task>/
├── (snapshot of the task workspace)
├── log.yaml
├── log.jsonl
└── run-metrics.json
Typical files:
run-metrics.json: start/end time, elapsed time, exit code, and test summary (when available)log.yaml: human-readable event log produced by the agent front-endlog.jsonl: raw event stream (useful for tooling)
All reports live in report/:
report/README.md: evaluation summary + per-task detailed results (tables)report/behavior_stats.md: behavior statistics inferred from tool actions inlog.yamlreport/BEHAVIOR_DEFINITIONS.md: category definitions used by behavior stats
Regenerate reports:
python3 report/eval_results_report.py
python3 report/behavior_stats.py- Create a top-level directory named after the model (optionally include config suffix).
- Copy the SWE-AGI task workspace snapshot into
./<model>/<task>/. - Save logs + metrics into the same folder (at minimum:
log.yaml+run-metrics.json).
Inspect a run summary:
jq . <model>/<task>/run-metrics.jsonRe-run tests for a snapshot locally (requires MoonBit tooling):
cd <model>/<task>
moon testApache-2.0. See LICENSE.
Copyright (c) 2026 MoonBit Team.