Skip to content

Commit 6b0e353

Browse files
committed
eval: remove LongMemEval harness, data references, and old result files
LongMemEval was useful for early development but the coding session eval is a better measure of real-world value. Removed: - eval/harness.ts, eval/evaluate.ts (LongMemEval harnesses) - eval/evaluation/ (Python scoring scripts) - 22 old result files (baseline_oracle, nuum_oracle, etc.) - LongMemEval section from README benchmarks - longmemeval gitignore pattern
1 parent b826797 commit 6b0e353

28 files changed

+1
-7371
lines changed

.gitignore

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,6 @@ dist/
55
*.db-wal
66
*.db-shm
77

8-
# Large eval benchmark data (download separately)
9-
eval/data/longmemeval_*.json
108

119
# Local plans and agent config
1210
.plans/

README.md

Lines changed: 1 addition & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -24,21 +24,6 @@ A **gradient context manager** decides how much of each tier to include in each
2424

2525
> Scores below are on Claude Sonnet 4 (claude-sonnet-4-6). Results may vary with other models.
2626
27-
### General memory recall
28-
29-
500-question evaluation using the [LongMemEval](https://github.com/xiaowu0162/LongMemEval) benchmark (ICLR 2025), tested in oracle mode (full message history provided as conversation context).
30-
31-
| Category | No plugin | Lore |
32-
|---------------------------|-----------|---------|
33-
| Single-session (user) | 71.9% | 93.8% |
34-
| Single-session (prefs) | 46.7% | 86.7% |
35-
| Single-session (assistant)| 91.1% | 96.4% |
36-
| Multi-session | 76.9% | 85.1% |
37-
| Knowledge updates | 84.7% | 93.1% |
38-
| Temporal reasoning | 64.6% | 81.9% |
39-
| Abstention | 53.3% | 86.7% |
40-
| **Overall** | **72.6%** | **88.0%** |
41-
4227
### Coding session recall
4328

4429
20 questions across 2 real coding sessions (113K and 353K tokens), targeting specific facts at varying depths. Default mode simulates OpenCode's actual behavior: compaction of early messages + 80K-token tail window. Lore mode uses on-the-fly distillation + the `recall` tool for searching raw message history.
@@ -87,7 +72,7 @@ This plugin was built in a few intense sessions. Some highlights:
8772

8873
**Markdown injection.** Property-based testing with fast-check revealed that user-generated content in facts (code fences, heading markers, thematic breaks) could break the markdown structure of the injected context, confusing the model.
8974

90-
**v2 — observation logs.** Switching to Mastra's observer/reflector architecture with plain-text timestamped observation logs was the breakthrough — LongMemEval jumped from 73.8% to 88.0%. The key insight: dated event logs preserve temporal relationships that structured JSON destroys.
75+
**v2 — observation logs.** Switching to Mastra's observer/reflector architecture with plain-text timestamped observation logs was the breakthrough. The key insight: dated event logs preserve temporal relationships that structured JSON destroys.
9176

9277
**Prompt refinements.** The push from 80% to 93.3% on the initial coding recall eval came from two observer prompt additions: "EXACT NUMBERS — NEVER APPROXIMATE" (the observer was rounding counts) and "BUG FIXES — ALWAYS RECORD" (early-session fixes were being compressed away during reflection).
9378

@@ -157,7 +142,6 @@ The assistant gets a `recall` tool that searches across stored messages and know
157142
- [How we solved the agent memory problem](https://www.sanity.io/blog/how-we-solved-the-agent-memory-problem) — Simen Svale at Sanity on the Nuum memory architecture: three-tier storage, distillation not summarization, recursive compression. The foundation this plugin is built on.
158143
- [Mastra Observational Memory](https://mastra.ai/research/observational-memory) — the observer/reflector architecture and the switch from structured JSON to timestamped observation logs that made v2 work.
159144
- [Mastra Memory source](https://github.com/mastra-ai/mastra/tree/main/packages/memory) — reference implementation.
160-
- [LongMemEval](https://arxiv.org/abs/2410.10813) — the evaluation benchmark (ICLR 2025) we used to measure progress.
161145
- [OpenCode](https://opencode.ai) — the coding agent this plugin extends.
162146

163147
## License

eval/evaluate.ts

Lines changed: 0 additions & 169 deletions
This file was deleted.

eval/evaluation/evaluate_qa.py

Lines changed: 0 additions & 134 deletions
This file was deleted.

0 commit comments

Comments
 (0)