You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
eval: remove LongMemEval harness, data references, and old result files
LongMemEval was useful for early development but the coding session
eval is a better measure of real-world value. Removed:
- eval/harness.ts, eval/evaluate.ts (LongMemEval harnesses)
- eval/evaluation/ (Python scoring scripts)
- 22 old result files (baseline_oracle, nuum_oracle, etc.)
- LongMemEval section from README benchmarks
- longmemeval gitignore pattern
Copy file name to clipboardExpand all lines: README.md
+1-17Lines changed: 1 addition & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -24,21 +24,6 @@ A **gradient context manager** decides how much of each tier to include in each
24
24
25
25
> Scores below are on Claude Sonnet 4 (claude-sonnet-4-6). Results may vary with other models.
26
26
27
-
### General memory recall
28
-
29
-
500-question evaluation using the [LongMemEval](https://github.com/xiaowu0162/LongMemEval) benchmark (ICLR 2025), tested in oracle mode (full message history provided as conversation context).
20 questions across 2 real coding sessions (113K and 353K tokens), targeting specific facts at varying depths. Default mode simulates OpenCode's actual behavior: compaction of early messages + 80K-token tail window. Lore mode uses on-the-fly distillation + the `recall` tool for searching raw message history.
@@ -87,7 +72,7 @@ This plugin was built in a few intense sessions. Some highlights:
87
72
88
73
**Markdown injection.** Property-based testing with fast-check revealed that user-generated content in facts (code fences, heading markers, thematic breaks) could break the markdown structure of the injected context, confusing the model.
89
74
90
-
**v2 — observation logs.** Switching to Mastra's observer/reflector architecture with plain-text timestamped observation logs was the breakthrough — LongMemEval jumped from 73.8% to 88.0%. The key insight: dated event logs preserve temporal relationships that structured JSON destroys.
75
+
**v2 — observation logs.** Switching to Mastra's observer/reflector architecture with plain-text timestamped observation logs was the breakthrough. The key insight: dated event logs preserve temporal relationships that structured JSON destroys.
91
76
92
77
**Prompt refinements.** The push from 80% to 93.3% on the initial coding recall eval came from two observer prompt additions: "EXACT NUMBERS — NEVER APPROXIMATE" (the observer was rounding counts) and "BUG FIXES — ALWAYS RECORD" (early-session fixes were being compressed away during reflection).
93
78
@@ -157,7 +142,6 @@ The assistant gets a `recall` tool that searches across stored messages and know
157
142
-[How we solved the agent memory problem](https://www.sanity.io/blog/how-we-solved-the-agent-memory-problem) — Simen Svale at Sanity on the Nuum memory architecture: three-tier storage, distillation not summarization, recursive compression. The foundation this plugin is built on.
158
143
-[Mastra Observational Memory](https://mastra.ai/research/observational-memory) — the observer/reflector architecture and the switch from structured JSON to timestamped observation logs that made v2 work.
0 commit comments