Skip to content

feat(memory): persisted go-vector episode index + featurization quality#12

Merged
jkyberneees merged 2 commits into
mainfrom
feat/memory-vector-recall
Jun 6, 2026
Merged

feat(memory): persisted go-vector episode index + featurization quality#12
jkyberneees merged 2 commits into
mainfrom
feat/memory-vector-recall

Conversation

@jkyberneees
Copy link
Copy Markdown
Contributor

Summary

Fixes weaknesses #3 and #4 from the memory design review using only the existing go-vector library — no new embedding dependency.

Problem

#4 — LLM call on every turn. FormatEpisodeContext fires on every agent loop turn. With the default llm_search:true it called episodes.SearchNewLLMRanker → one llm.SimpleCall per turn. The RP fallback was no better at scale: NewRPRanker re-instantiated RandomProjections and re-fit + re-embedded every episode on every call — no caching at all.

#3 — RP quality ceiling. Episodes were embedded from the 120-char truncated index summary (not the full ~1 KB on-disk summary), at 64 dims while everything else used 256. Text was fed raw to go-vector, so "Postgres," and "postgres" were different tokens.

Solution

Persisted go-vector episode index (episode_index.go)

Mirrors internal/session/vector_index.go:

  • vector.Store (brute-force k-NN) + RandomProjections (256 dims) persisted as gob files.
  • Dirty-flag + full rebuild design. RP must be Fit on the full corpus to produce a valid vocabulary; incremental embedding with a stale Fit yields degenerate vectors. Since writes happen at session-end (infrequent) and recall fires every turn (frequent), the right trade-off is: markDirty() on write (no I/O), then rebuild from all on-disk summaries on the next search() call. One O(n) rebuild per new episode, then µs cosine per turn.
  • Process-wide singleton per memory dir (sharedEpisodeIndex, mirroring factsDirLock) so the per-connection MemoryManagers that odek serve builds all share one index and don't race on the gob files.
  • Reads full on-disk summaries for the rebuild corpus (not the 120-char index truncations).

Per-turn recall is now zero-LLM

FormatEpisodeContextepisodes.recallByVectorsharedEpisodeIndex.search → embed query + N cached cosines. No LLM call on the hot path, verified by test.

Explicit search: bounded LLM rerank

SearchEpisodes (the memory tool) now fetches candidates from the vector index first, then LLM-reranks only those candidates — fixing both the O(n)·LLM scaling problem and keeping relevance quality. llm_search now gates explicit-search reranking only; per-turn recall ignores it.

Featurization quality (featurize.go)

  • normalizeForEmbedding: lowercase + alphanumeric-only tokens (strips punctuation so "Postgres," == "postgres").
  • featurizeForEmbedding: normalise + append bigrams ("w1_w2") for local word-order signal. "uses postgres" and "uses mysql" share the "uses" unigram but differ on bigrams; RP now discriminates them.
  • Applied at the go-vector boundary in the episode index and MergeDetector (facts). Raw strings kept in m.corpus; only RP sees featurized text.

Verification

  • gofmt/build/vet clean.
  • go test -race ./internal/memory/... passes — all prior tests green + new tests:
    • Zero LLM calls on per-turn recall (TestEpisodeIndex_FormatEpisodeContextNoLLM)
    • Cold start: index rebuilds from disk, returns relevant episodes
    • Persistence: second Init loads gobs, same results
    • Dirty rebuild: new episode picked up after markDirty
    • Provenance filter: untrusted/unapproved excluded from recallByVector
    • Concurrent safety under -race: N goroutines sharing one dir
    • Featurization discrimination: "uses postgres" vs "migrated to mysql"add; vs "database is postgres"merge/judge
    • Featurize unit tests, absolute-path singleton identity

🤖 Generated with Claude Code

jkyberneees and others added 2 commits June 6, 2026 11:38
Fixes the two remaining memory weaknesses using only the existing go-vector
library — no new embedding dependency.

**#4 — per-turn LLM call eliminated.**
FormatEpisodeContext previously called episodes.Search → NewLLMRanker →
one llm.SimpleCall per turn. The RP fallback was no better at scale:
NewRPRanker re-instantiated RandomProjections and re-fit + re-embedded every
episode on every Search call (no caching).

New episodeVectorIndex (mirroring session/vector_index.go) persists a
go-vector Store + RandomProjections embedder to gob files. FormatEpisodeContext
now calls episodes.recallByVector → sharedEpisodeIndex.search → embed query +
N cached cosines. Zero LLM calls on the per-turn path.

Design: dirty-flag + full rebuild. RP must be Fit on the full corpus to produce
a valid vocabulary — incremental embedding after a stale Fit yields degenerate
vectors. Episodes are written at session-end (infrequent); rebuild is triggered
on the next search after any write, then cached for all subsequent turns until
the next write. One O(n) rebuild per session-end, then µs cosine per turn.

Process-wide singleton per memory directory (sharedEpisodeIndex, mirroring
factsDirLock) prevents concurrent serve.go per-connection managers from racing
on the gob files.

SearchEpisodes (explicit memory tool) now fetches candidates from the vector
index and LLM-reranks only those candidates (bounded), fixing the O(n)·LLM
cost in the explicit search path too. llm_search now gates explicit search
reranking only — per-turn recall always uses RP.

**#3 — featurization quality.**
New featurize.go: normalizeForEmbedding (lowercase + alphanumeric tokens,
strips punctuation so "Postgres," == "postgres") + featurizeForEmbedding
(normalise + bigrams "w1_w2" for light local word order). Applied at the
go-vector boundary in both the episode index (full ~1KB on-disk summaries,
up from 120-char truncated; 256 dims, up from 64) and MergeDetector.Fit/
Classify/AppendEntry/ReplaceEntry. Raw strings are preserved in m.corpus;
only the RP boundary uses featurized text.

Verified: FormatEpisodeContext fires zero LLM calls; postgres vs mysql episodes
rank distinctly; postgres vs "database is postgres" ranks similar; persistence
round-trips; dirty rebuild picks up new episodes; provenance filter holds;
concurrent safety under -race; all existing tests green.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…4/D-05/D-06)

Remediates the AI Verification Protocol findings on PR #12.

D-05 (OOV zero-score bypass): when a query has no vocabulary overlap with the
episode corpus, go-vector Embed returns a zero vector and Store.Search returns k
results all with cosine similarity=0. recallByVector was returning those as
non-empty candidates, so SearchEpisodes skipped the LLM fallback with noise.
Fix: filter zero-score results in recallByVector before returning; an all-OOV
query now returns nil, correctly triggering the fallback path in SearchEpisodes.

D-06 (SearchEpisodes at 52.9% coverage): four branches untested.
Added: llm_search=false (no LLM), nil LLM client, limit truncation, rankFn
error fallback, and OOV query triggering the nil-then-fallback path.
Coverage now 76.5%.

D-04 (multi-process caveat undocumented): the in-process singleton serializes
concurrent MemoryManagers within one process, but two separate odek processes
sharing ~/.odek/memory are not serialized. Added explicit documentation to the
singleton var block explaining the limitation and its bounded impact.

D-01/D-02/D-03/D-07/D-08/D-09 confirmed held (no fix needed): double-checked
locking is correct; per-TempDir tests prevent singleton bleed; corrupted emb gob
falls back to rebuild; featurization is symmetric across all Fit/Embed paths.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@jkyberneees
Copy link
Copy Markdown
Contributor Author

Verification Protocol (v5.2.7) — Remediation pass (commit f1e7309)

All certificate findings closed.

Finding Sev Status
D-05 OOV query returns zero-score noise → LLM fallback bypassed WARN ✅ Fixed — recallByVector now strips scores ≤ 0 before returning; all-OOV query returns nil, correctly triggering fallback in SearchEpisodes.
D-06 SearchEpisodes at 52.9% coverage (4 branches untested) WARN ✅ Fixed — added llm_search=false, nil-LLM, limit-truncation, rankFn-error-fallback, and OOV-triggers-nil-then-fallback tests. Coverage: 52.9% → 76.5%.
D-04 Multi-process gob contention undocumented WARN ✅ Documented — added explicit note to the singleton var block explaining the in-process vs cross-process serialization boundary and its bounded impact.
D-01 TOCTOU double-checked locking info ✅ Confirmed correct — second waiter always hits double-check and returns.
D-02 Singleton bleed across tests info ✅ Confirmed safe — each test uses unique t.TempDir().
D-03 Mismatched gobs on partial write info ✅ Self-heals — tryLoadLocked falls through to rebuild if emb gob is corrupt; verified with injected corruption.
D-07/08/09 info ✅ All held.

Re-scored axes

2.1 ✅ (OOV fix) · 2.2 ✅ (branches covered) · 2.3 ✅ · 2.4 ✅ (documented) · 2.5 ✅ · 2.6 ✅ · 2.7 ⚠️ (monoculture) · 2.8 ✅ · 2.9 ✅

Verdict: 🟡 HumanReviewRequired — monoculture only. No axis is 🔴 and no substantive finding remains. The HumanReviewRequired floor is the monoculture ρ penalty (same model authored and verified). All content findings are fixed.

@jkyberneees jkyberneees merged commit 686fb6f into main Jun 6, 2026
6 checks passed
@jkyberneees jkyberneees deleted the feat/memory-vector-recall branch June 6, 2026 10:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant