fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability by Huntehhh · Pull Request #344 · buildingjoshbetter/TrueMemory

Huntehhh · 2026-05-16T22:02:25Z

Summary

Three compounding failure modes caused the MCP server to hang or crash on cold
start. Plus a sibling Windows-portability fix in pytest test decorators
(same hasattr-guard theme as the WNOHANG fix; was blocking pytest
collection on Windows entirely).

Sync handlers — all 9 @mcp.tool() handlers were sync def, so one
blocked handler froze every concurrent JSON-RPC request on FastMCP's
single asyncio thread. Converted to async def with engine calls wrapped
in asyncio.to_thread. @tracked is now async-aware so it doesn't
re-wrap coroutines as sync functions.
Reranker preload hangs — CrossEncoder(...) in _preload_models
blocks indefinitely on corrupt HuggingFace cache, stalled download, or
Windows Defender ASR denying a sentencepiece shim. Added a 30 s watchdog
(TRUEMEMORY_RERANKER_TIMEOUT_SEC override); on timeout, marks
degraded and rerank entrypoints fall back to original-ordering results.
_set_reranker short-circuits when degraded so search calls don't
block on the stalled load's lock. Degraded state surfaces in the F06
health payload (truememory_stats.health.reranker).
os.WNOHANG is POSIX-only — _reap_children called
os.waitpid(-1, os.WNOHANG), crashing every Windows user's backlog
drainer with AttributeError on every boot. Guarded with hasattr.

Also: engine.add() now pre-computes embeddings outside _write_lock so
concurrent stores overlap on inference instead of serializing.

Changes

File	Change
`truememory/mcp_server.py`	9 handlers → `async def` + `asyncio.to_thread`; reranker watchdog with `TRUEMEMORY_RERANKER_TIMEOUT_SEC` (default 30 s, validated); `_reap_children` POSIX guard; `_set_reranker` short-circuit on degraded; watchdog wires `_record_reranker_error` so health payload reflects truth; `_parse_reranker_timeout` helper
`truememory/reranker.py`	`_load_failed` flag; `is_degraded()` + `mark_degraded(reason)` public API; early return in `rerank()`, `rerank_with_fusion()`, `rerank_with_modality_fusion()` when degraded
`truememory/telemetry.py`	`@tracked` async-aware: detects `iscoroutinefunction` and wraps with `async_wrapper` so FastMCP sees the correct function type
`truememory/engine.py`	`add()` pre-computes content + separation embeddings BEFORE `_write_lock` — concurrent stores overlap on inference (~10-50 ms per encode, GIL released inside `.encode()`)
`tests/test_concurrent_store_hang.py`	New file (3 tests): engine concurrency, MCP handler async shape, `asyncio.gather` end-to-end. Includes `truememory_status` in the async-handler list
`tests/test_cold_start_resilience.py`	New file (14 tests): WNOHANG guard, degraded-flag lifecycle, watchdog timeout + fast-load non-regression, `_set_reranker` short-circuit, health-payload wiring on both timeout and exception paths, timeout-parser validation (positive / zero / negative / non-integer / unset)
`tests/test_health_stats.py`	Wraps now-async `truememory_stats()` in `asyncio.run()`
`tests/ingest/test_robustness_fixes.py`	4 `@pytest.mark.skipif` decorators: `os.geteuid()` → `not hasattr(os, "geteuid") or os.geteuid() == 0`. Same POSIX-only-API pattern as WNOHANG; was crashing `pytest` collection on Windows
`CHANGELOG.md`	New `## [Unreleased]` section documenting all of the above

Test Plan

python -m pytest tests/test_cold_start_resilience.py -v → 14 passed
python -m pytest tests/test_concurrent_store_hang.py -v → 3 passed
Cold start: truememory-mcp reaches "server ready" within 30 s, no boot hang
Windows: no AttributeError: module 'os' has no attribute 'WNOHANG' in logs
TRUEMEMORY_LAZY_MODELS=1 — server starts instantly, models load lazily on first search
Corrupt HF cache (rm -rf ~/.cache/huggingface/hub/models--BAAI--bge-reranker-v2-m3*): confirm "preload exceeded 30s" appears in logs AND truememory_stats.health.reranker.status == "degraded" AND truememory_search returns results (sans rerank)
TRUEMEMORY_RERANKER_TIMEOUT_SEC=0: WARNING logged at startup, watchdog still uses 30 s default

Design Notes

Why 30 s default? Cold load of BAAI/bge-reranker-v2-m3 on a typical
machine with warm HF cache: ~3-8 s. First-time download over slow network:
10-25 s. 30 s gives generous headroom while still surfacing real hangs.
Why no "disable watchdog" option? TRUEMEMORY_LAZY_MODELS=1 already
covers the "I don't want preload" path. Allowing TIMEOUT_SEC=0 to mean
"disable" is a footgun — a typo like TIMEOUT_SEC= in a shell script
becomes 0 and silently disables the safety net.
Why not kill the orphaned reranker thread? Python threading has no
safe kill primitive. The orphan thread sits at 0 % CPU using ~5 MB until
process exit. Documented limitation; restart recovers cleanly.
Why not retry the reranker load? Once degraded, recovery requires
process restart (clears any cache state too). Auto-retry hides root
causes (HF cache corruption, ASR rule, network) that the user needs to
fix once.

Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai

Merge ordering

Status: Currently CONFLICTING against origin/main after PR #343 (fix/resource-efficiency-v1) merged. Trivial rebase needed before merge — branch was cut off 9d18019; main is now at 6ec0be5. I'll push the rebase the moment Hunter or a reviewer gives the go.

Blocks: #351 (Windows subprocess portability — mcp_server.py:1181 backlog drainer Popen layers on top of this PR's async-handler conversion).

Closes: #318 — Hunter's May-14 commit 5747b99 is the cherry-pick base of this PR's 68989d0. This PR is a strict superset (adds truememory_status async fix, Windows WNOHANG guard, reranker watchdog + degraded mode + _set_reranker short-circuit + health-payload wiring, regression tests, os.geteuid test fix, CHANGELOG).

Recommended merge sequence (10 PRs from a 3-agent coordination — see ~/.claude/handoffs/truememory-coordination-registry.md on Hunter's machine):

#353 (CI runner) → #344 (this) → #345 → #346 → #351 → #348 → #352 → #347 → #349 → #350

…ore hang Resolves the 10-15s harness hang when 3+ truememory_store or search MCP calls fire in parallel. Three layered changes: 1. mcp_server.py — 7 hot-path @mcp.tool() handlers (store / search / search_deep / get / forget / stats / entity_profile) changed from sync `def` to `async def`. Engine calls run via `await asyncio.to_thread(...)` so FastMCP's event-loop thread stays free for concurrent JSON-RPC requests. truememory_configure stays sync — heavy state mutation, called once at setup. 2. telemetry.py — `@tracked` is now async-aware. Wrapping an `async def` in the old sync wrapper produced an unawaited coroutine object that silently defeated the async-ification. 3. engine.py — `add()` pre-computes both content + separation embeddings OUTSIDE `_write_lock`. Previously the lock was held during the two ~10-50ms model.encode() calls, serializing all concurrent stores. PyTorch releases the GIL inside .encode(), so concurrent stores can now overlap on inference; they only contend at the INSERTs (μs). Tests: - tests/test_concurrent_store_hang.py (new): three regression locks — threaded engine.add(), MCP handler-shape check, asyncio.gather() end-to-end. - tests/test_health_stats.py: wrap the now-async truememory_stats() in asyncio.run(). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

The May 14 async-handler commit (5747b99) didn't cover truememory_status because it didn't exist yet at that base. status calls RebuildManager.get_status which opens a SQLite connection — blocking I/O that should run via asyncio.to_thread like the other async handlers. Otherwise polling during a rebuild would block the event loop. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

os.WNOHANG is POSIX-only; on Windows, os has no WNOHANG attribute and the backlog drainer thread crashes with AttributeError on every boot for every Windows user. Add a hasattr guard at the top of _reap_children so it returns cleanly on platforms without the syscall. Windows doesn't need zombie reaping anyway — terminated children release their PID immediately, so there are no defunct processes to clean up. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

A stalled reranker preload (corrupt HuggingFace cache, blocked download, Windows Defender ASR denying a sentencepiece shim) currently leaves the preload thread alive forever. Every subsequent search call then blocks inside get_reranker() waiting on the same stalled load — which is why a "frozen MCP" session always also has stats/search/forget hanging. Three coordinated changes: 1. reranker.py — module-level _load_failed flag, is_degraded() reader, mark_degraded(reason) setter (logs once). rerank / rerank_with_fusion / rerank_with_modality_fusion check the flag at entry and return the original candidate ordering rather than calling get_reranker. 2. mcp_server.py _preload_models — _load_reranker now also calls mark_degraded on exception. A new watchdog thread joins on the loader for TRUEMEMORY_RERANKER_TIMEOUT_SEC (default 30s) and marks degraded if still alive. The loader thread is orphaned (Python can't safely kill threads) but no caller blocks on it. 3. mcp_server.py — new _RERANKER_LOAD_TIMEOUT_SEC constant near the existing _MODEL_IDLE_TIMEOUT_SEC for style consistency. Result: cold start emits either a successful reranker load OR a single WARNING line within the timeout window, and the server stays responsive in degraded mode (non-reranked search). On next process restart the degraded flag clears and load is retried. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

- test_cold_start_resilience.py (new): pins down the 4 fixes from this PR. WNOHANG guard returns cleanly when os has no WNOHANG. is_degraded starts False, mark_degraded sets it. rerank / rerank_with_modality_fusion return original ordering when degraded (proves get_reranker is never called). Preload watchdog marks degraded on a stalled load and does NOT mark on a fast load (no false-positive regression). - test_concurrent_store_hang.py: extend the async-handler regression list to include truememory_status, which the May 14 commit predated. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Four test decorators called os.geteuid() at module-import time to skip when running as POSIX root. WNOHANG-sister bug — geteuid is POSIX-only, so on Windows the import itself raised AttributeError and pytest could not even collect the file. Symptom: full test suite errored at collection on every Windows contributor's machine. Wrap each decorator with a hasattr check so the test skips cleanly on Windows (where chmod 555 / chmod 000 don't enforce read-only the same way on NTFS as on POSIX anyway). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

…out validation Three follow-on fixes on top of the watchdog/degraded-flag commit so the cold-start fix actually delivers in production conditions: 1. _set_reranker now short-circuits when reranker.is_degraded(). Without this, every truememory_search / truememory_search_deep call would still block — not on the event loop (the async-handler fix protects that) but on the reranker._lock that the stalled preload thread is holding. So the thread pool serializes instead, eating slots until exhausted. 2. The watchdog and the preload exception path now both call _record_reranker_error in addition to mark_degraded, so the F06 health payload (truememory_stats.health.reranker) reflects the degraded state immediately. Without this, operators would see status="ok" while search is silently falling back. 3. TRUEMEMORY_RERANKER_TIMEOUT_SEC is now parsed through _parse_reranker_timeout — non-positive and non-integer values fall back to the default with a warning. A typo like an unset env var (TRUEMEMORY_RERANKER_TIMEOUT_SEC= in a shell script becomes 0) used to make thread.join(timeout=0) return immediately, marking degraded on every boot. The legitimate "skip preload" path is TRUEMEMORY_LAZY_MODELS=1. Includes regression tests for all three paths. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Documents the seven commits of this PR series — async-handler conversion, status async fix, WNOHANG guard, reranker watchdog + degraded fallback, test portability, _set_reranker short-circuit, health-payload wiring, and TIMEOUT_SEC validation — plus the new public surface (TRUEMEMORY_RERANKER_TIMEOUT_SEC env var, reranker.is_degraded / mark_degraded helpers, two regression test files). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Huntehhh and others added 8 commits May 16, 2026 16:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344

fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344
Huntehhh wants to merge 8 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/mcp-cold-start-resilience

Huntehhh commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Huntehhh commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Design Notes

Merge ordering

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Huntehhh commented May 16, 2026 •

edited

Loading