0.3.0
What's changed
New eval infrastructure
- Self-contained session eval harness (
eval/session_eval.ts): loads session transcripts from JSON files, distills on the fly, seeds the DB so therecalltool works during evaluation, and compares against OpenCode's actual compaction behavior (summary of early messages + 80K tail window) - 20 questions across two real coding sessions (113K and 353K tokens)
- Token tracking with cost-per-correct-answer metrics
Results (Claude Sonnet 4)
| Mode | Score | Cost |
|---|---|---|
| Default (compaction + 80K tail) | 10/20 (50%) | $8.14 |
| Lore (distillation + recall) | 17/20 (85%) | $1.87 |
Lore's 35pp accuracy advantage comes entirely from early/mid-session details outside the tail window. Late details are tied. Cost per correct answer: $0.11 vs $0.81 (7.4x cheaper).
Bug fixes
agents-file: sort category headings alphabetically in AGENTS.md exportltmtest: fix monotonic ID test failing on fast CI (Date.now() collision)src/index.ts: session error handler now skips eval/child sessions
Eval harness improvements
- All eval files import
DISTILLATION_SYSTEMfromsrc/prompt(DRY) backfill.ts:--wipeflag to clear old distillations before re-distillingcoding_eval.ts: token tracking, stronger eval session isolation- Removed LongMemEval harness and old result files