feat(memory): persisted go-vector episode index + featurization quality#12
Merged
Conversation
Fixes the two remaining memory weaknesses using only the existing go-vector library — no new embedding dependency. **#4 — per-turn LLM call eliminated.** FormatEpisodeContext previously called episodes.Search → NewLLMRanker → one llm.SimpleCall per turn. The RP fallback was no better at scale: NewRPRanker re-instantiated RandomProjections and re-fit + re-embedded every episode on every Search call (no caching). New episodeVectorIndex (mirroring session/vector_index.go) persists a go-vector Store + RandomProjections embedder to gob files. FormatEpisodeContext now calls episodes.recallByVector → sharedEpisodeIndex.search → embed query + N cached cosines. Zero LLM calls on the per-turn path. Design: dirty-flag + full rebuild. RP must be Fit on the full corpus to produce a valid vocabulary — incremental embedding after a stale Fit yields degenerate vectors. Episodes are written at session-end (infrequent); rebuild is triggered on the next search after any write, then cached for all subsequent turns until the next write. One O(n) rebuild per session-end, then µs cosine per turn. Process-wide singleton per memory directory (sharedEpisodeIndex, mirroring factsDirLock) prevents concurrent serve.go per-connection managers from racing on the gob files. SearchEpisodes (explicit memory tool) now fetches candidates from the vector index and LLM-reranks only those candidates (bounded), fixing the O(n)·LLM cost in the explicit search path too. llm_search now gates explicit search reranking only — per-turn recall always uses RP. **#3 — featurization quality.** New featurize.go: normalizeForEmbedding (lowercase + alphanumeric tokens, strips punctuation so "Postgres," == "postgres") + featurizeForEmbedding (normalise + bigrams "w1_w2" for light local word order). Applied at the go-vector boundary in both the episode index (full ~1KB on-disk summaries, up from 120-char truncated; 256 dims, up from 64) and MergeDetector.Fit/ Classify/AppendEntry/ReplaceEntry. Raw strings are preserved in m.corpus; only the RP boundary uses featurized text. Verified: FormatEpisodeContext fires zero LLM calls; postgres vs mysql episodes rank distinctly; postgres vs "database is postgres" ranks similar; persistence round-trips; dirty rebuild picks up new episodes; provenance filter holds; concurrent safety under -race; all existing tests green. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…4/D-05/D-06) Remediates the AI Verification Protocol findings on PR #12. D-05 (OOV zero-score bypass): when a query has no vocabulary overlap with the episode corpus, go-vector Embed returns a zero vector and Store.Search returns k results all with cosine similarity=0. recallByVector was returning those as non-empty candidates, so SearchEpisodes skipped the LLM fallback with noise. Fix: filter zero-score results in recallByVector before returning; an all-OOV query now returns nil, correctly triggering the fallback path in SearchEpisodes. D-06 (SearchEpisodes at 52.9% coverage): four branches untested. Added: llm_search=false (no LLM), nil LLM client, limit truncation, rankFn error fallback, and OOV query triggering the nil-then-fallback path. Coverage now 76.5%. D-04 (multi-process caveat undocumented): the in-process singleton serializes concurrent MemoryManagers within one process, but two separate odek processes sharing ~/.odek/memory are not serialized. Added explicit documentation to the singleton var block explaining the limitation and its bounded impact. D-01/D-02/D-03/D-07/D-08/D-09 confirmed held (no fix needed): double-checked locking is correct; per-TempDir tests prevent singleton bleed; corrupted emb gob falls back to rebuild; featurization is symmetric across all Fit/Embed paths. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
Author
Verification Protocol (v5.2.7) — Remediation pass (commit
|
| Finding | Sev | Status |
|---|---|---|
| D-05 OOV query returns zero-score noise → LLM fallback bypassed | WARN | ✅ Fixed — recallByVector now strips scores ≤ 0 before returning; all-OOV query returns nil, correctly triggering fallback in SearchEpisodes. |
D-06 SearchEpisodes at 52.9% coverage (4 branches untested) |
WARN | ✅ Fixed — added llm_search=false, nil-LLM, limit-truncation, rankFn-error-fallback, and OOV-triggers-nil-then-fallback tests. Coverage: 52.9% → 76.5%. |
| D-04 Multi-process gob contention undocumented | WARN | ✅ Documented — added explicit note to the singleton var block explaining the in-process vs cross-process serialization boundary and its bounded impact. |
| D-01 TOCTOU double-checked locking | info | ✅ Confirmed correct — second waiter always hits double-check and returns. |
| D-02 Singleton bleed across tests | info | ✅ Confirmed safe — each test uses unique t.TempDir(). |
| D-03 Mismatched gobs on partial write | info | ✅ Self-heals — tryLoadLocked falls through to rebuild if emb gob is corrupt; verified with injected corruption. |
| D-07/08/09 | info | ✅ All held. |
Re-scored axes
2.1 ✅ (OOV fix) · 2.2 ✅ (branches covered) · 2.3 ✅ · 2.4 ✅ (documented) · 2.5 ✅ · 2.6 ✅ · 2.7
Verdict: 🟡 HumanReviewRequired — monoculture only. No axis is 🔴 and no substantive finding remains. The HumanReviewRequired floor is the monoculture ρ penalty (same model authored and verified). All content findings are fixed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes weaknesses #3 and #4 from the memory design review using only the existing go-vector library — no new embedding dependency.
Problem
#4 — LLM call on every turn.
FormatEpisodeContextfires on every agent loop turn. With the defaultllm_search:trueit calledepisodes.Search→NewLLMRanker→ onellm.SimpleCallper turn. The RP fallback was no better at scale:NewRPRankerre-instantiatedRandomProjectionsand re-fit + re-embedded every episode on every call — no caching at all.#3 — RP quality ceiling. Episodes were embedded from the 120-char truncated index summary (not the full ~1 KB on-disk summary), at 64 dims while everything else used 256. Text was fed raw to go-vector, so
"Postgres,"and"postgres"were different tokens.Solution
Persisted go-vector episode index (episode_index.go)
Mirrors
internal/session/vector_index.go:vector.Store(brute-force k-NN) +RandomProjections(256 dims) persisted as gob files.Fiton the full corpus to produce a valid vocabulary; incremental embedding with a staleFityields degenerate vectors. Since writes happen at session-end (infrequent) and recall fires every turn (frequent), the right trade-off is:markDirty()on write (no I/O), then rebuild from all on-disk summaries on the nextsearch()call. One O(n) rebuild per new episode, then µs cosine per turn.sharedEpisodeIndex, mirroringfactsDirLock) so the per-connectionMemoryManagers thatodek servebuilds all share one index and don't race on the gob files.Per-turn recall is now zero-LLM
FormatEpisodeContext→episodes.recallByVector→sharedEpisodeIndex.search→ embed query + N cached cosines. No LLM call on the hot path, verified by test.Explicit search: bounded LLM rerank
SearchEpisodes(thememorytool) now fetches candidates from the vector index first, then LLM-reranks only those candidates — fixing both theO(n)·LLMscaling problem and keeping relevance quality.llm_searchnow gates explicit-search reranking only; per-turn recall ignores it.Featurization quality (featurize.go)
normalizeForEmbedding: lowercase + alphanumeric-only tokens (strips punctuation so"Postgres,"=="postgres").featurizeForEmbedding: normalise + append bigrams ("w1_w2") for local word-order signal."uses postgres"and"uses mysql"share the"uses"unigram but differ on bigrams; RP now discriminates them.MergeDetector(facts). Raw strings kept inm.corpus; only RP sees featurized text.Verification
gofmt/build/vetclean.go test -race ./internal/memory/...passes — all prior tests green + new tests:TestEpisodeIndex_FormatEpisodeContextNoLLM)Initloads gobs, same resultsmarkDirtyrecallByVector-race: N goroutines sharing one dir"uses postgres"vs"migrated to mysql"→add; vs"database is postgres"→merge/judge🤖 Generated with Claude Code