Skip to content

0.3.0

Choose a tag to compare

@BYK BYK released this 26 Feb 15:51
· 101 commits to main since this release
0.3.0
545bb85

What's changed

New eval infrastructure

  • Self-contained session eval harness (eval/session_eval.ts): loads session transcripts from JSON files, distills on the fly, seeds the DB so the recall tool works during evaluation, and compares against OpenCode's actual compaction behavior (summary of early messages + 80K tail window)
  • 20 questions across two real coding sessions (113K and 353K tokens)
  • Token tracking with cost-per-correct-answer metrics

Results (Claude Sonnet 4)

Mode Score Cost
Default (compaction + 80K tail) 10/20 (50%) $8.14
Lore (distillation + recall) 17/20 (85%) $1.87

Lore's 35pp accuracy advantage comes entirely from early/mid-session details outside the tail window. Late details are tied. Cost per correct answer: $0.11 vs $0.81 (7.4x cheaper).

Bug fixes

  • agents-file: sort category headings alphabetically in AGENTS.md export
  • ltm test: fix monotonic ID test failing on fast CI (Date.now() collision)
  • src/index.ts: session error handler now skips eval/child sessions

Eval harness improvements

  • All eval files import DISTILLATION_SYSTEM from src/prompt (DRY)
  • backfill.ts: --wipe flag to clear old distillations before re-distilling
  • coding_eval.ts: token tracking, stronger eval session isolation
  • Removed LongMemEval harness and old result files