Skip to content

fix: shared model server + tier switch bug fixes#343

Merged
buildingjoshbetter merged 5 commits into
mainfrom
fix/resource-efficiency-v1
May 16, 2026
Merged

fix: shared model server + tier switch bug fixes#343
buildingjoshbetter merged 5 commits into
mainfrom
fix/resource-efficiency-v1

Conversation

@buildingjoshbetter
Copy link
Copy Markdown
Owner

Summary

Test plan

  • All 600 tests pass (11 deselected are pre-existing failures on main in stop_hook_safety + spawn_gate)
  • Zero ruff lint errors
  • Each change reviewed by 3-model adversarial panel (Gemini 2.5 Pro, Grok 4.1, Qwen3 235B)
  • Manual test: start MCP server, verify model_server.pid appears, run search, verify single model process
  • Manual test: kill model server, verify next search falls back to local loading
  • Manual test: tier switch with delta rebuild after fix (verify last_embedded_id preserved)

VectorCacheRegistry.set() was called without last_embedded_id, resetting
it to 0 after every rebuild. This broke delta rebuilds since the system
would think no messages had been embedded. Now queries the vec table for
the actual MAX(rowid) and COUNT(*) before finalizing.
migrate_legacy_vec_tables() was using the currently-active tier group,
which could put edge-model vectors into basepro tables. Now detects the
actual model from metadata (defaulting to edge for pre-tier-switch DBs)
and uses VectorCacheRegistry.set() for proper registration.
Replace triple-sampling + adaptive sleeping with a single RAM check
per batch. Old behavior: 7s overhead per batch (5s triple-sample +
2s adaptive sleep) for 1s of work. New: 0.2s pause + 5s only if RAM
< 2GB. Batch sizes fixed at init based on total RAM and device type.
OOM recovery still handled by the worker (halve and retry).
Models were preloaded eagerly on startup, then the idle timer would
unload them 5 minutes later if no search happened, wasting the initial
load. Now lazy-loads by default (models load on first search). Users
who want eager preloading can set TRUEMEMORY_PRELOAD_MODELS=1.
#335)

Add a standalone model server process that loads the embedding model and
reranker ONCE, serving all TrueMemory processes (MCP server, ingest
hooks, CLI) over a Unix domain socket. Reduces memory from ~10GB (5
processes x 2GB each) to ~2.5GB (1 server + 5 lightweight clients).

- truememory/model_server.py: UDS listener, lazy model loading, idle
  timeout auto-shutdown, PID lifecycle management
- truememory/model_client.py: EmbeddingProxy/RerankerProxy drop-in
  replacements, auto-start logic, transparent fallback to local loading
- Integration: get_model() and get_reranker() use server when available,
  fall back to local loading when server isn't running (e.g., in tests)
- MCP server startup calls ensure_server_running() to launch the server
- Set TRUEMEMORY_NO_MODEL_SERVER=1 to force local loading
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant