Skip to content

moonbitlang/SWE-AGI-Eval

Repository files navigation

SWE-AGI-Eval: Run Archive & Reports

This repository archives SWE-AGI evaluation runs (workspace snapshots + logs + metrics) for multiple models so results can be compared and reviewed later.

This repo is not the benchmark itself. For the benchmark definitions and task workspace , see the main SWE-AGI repo.

What’s in this repo

1) Archived runs

Each model has a top-level directory, and each task has a subdirectory:

SWE-AGI-Eval/
└── <model>/
    └── <task>/
        ├── (snapshot of the task workspace)
        ├── log.yaml
        ├── log.jsonl
        └── run-metrics.json

Typical files:

  • run-metrics.json: start/end time, elapsed time, exit code, and test summary (when available)
  • log.yaml: human-readable event log produced by the agent front-end
  • log.jsonl: raw event stream (useful for tooling)

2) Generated reports

All reports live in report/:

  • report/README.md: evaluation summary + per-task detailed results (tables)
  • report/behavior_stats.md: behavior statistics inferred from tool actions in log.yaml
  • report/BEHAVIOR_DEFINITIONS.md: category definitions used by behavior stats

Regenerate reports:

python3 report/eval_results_report.py
python3 report/behavior_stats.py

Adding a new archive (convention)

  1. Create a top-level directory named after the model (optionally include config suffix).
  2. Copy the SWE-AGI task workspace snapshot into ./<model>/<task>/.
  3. Save logs + metrics into the same folder (at minimum: log.yaml + run-metrics.json).

Common commands

Inspect a run summary:

jq . <model>/<task>/run-metrics.json

Re-run tests for a snapshot locally (requires MoonBit tooling):

cd <model>/<task>
moon test

License

Apache-2.0. See LICENSE.

Copyright

Copyright (c) 2026 MoonBit Team.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors