Skip to content

fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344

Open
Huntehhh wants to merge 8 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/mcp-cold-start-resilience
Open

fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344
Huntehhh wants to merge 8 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/mcp-cold-start-resilience

Conversation

@Huntehhh
Copy link
Copy Markdown
Contributor

@Huntehhh Huntehhh commented May 16, 2026

Summary

Three compounding failure modes caused the MCP server to hang or crash on cold
start. Plus a sibling Windows-portability fix in pytest test decorators
(same hasattr-guard theme as the WNOHANG fix; was blocking pytest
collection on Windows entirely).

  1. Sync handlers — all 9 @mcp.tool() handlers were sync def, so one
    blocked handler froze every concurrent JSON-RPC request on FastMCP's
    single asyncio thread. Converted to async def with engine calls wrapped
    in asyncio.to_thread. @tracked is now async-aware so it doesn't
    re-wrap coroutines as sync functions.

  2. Reranker preload hangsCrossEncoder(...) in _preload_models
    blocks indefinitely on corrupt HuggingFace cache, stalled download, or
    Windows Defender ASR denying a sentencepiece shim. Added a 30 s watchdog
    (TRUEMEMORY_RERANKER_TIMEOUT_SEC override); on timeout, marks
    degraded and rerank entrypoints fall back to original-ordering results.
    _set_reranker short-circuits when degraded so search calls don't
    block on the stalled load's lock. Degraded state surfaces in the F06
    health payload (truememory_stats.health.reranker).

  3. os.WNOHANG is POSIX-only_reap_children called
    os.waitpid(-1, os.WNOHANG), crashing every Windows user's backlog
    drainer with AttributeError on every boot. Guarded with hasattr.

Also: engine.add() now pre-computes embeddings outside _write_lock so
concurrent stores overlap on inference instead of serializing.

Changes

File Change
truememory/mcp_server.py 9 handlers → async def + asyncio.to_thread; reranker watchdog with TRUEMEMORY_RERANKER_TIMEOUT_SEC (default 30 s, validated); _reap_children POSIX guard; _set_reranker short-circuit on degraded; watchdog wires _record_reranker_error so health payload reflects truth; _parse_reranker_timeout helper
truememory/reranker.py _load_failed flag; is_degraded() + mark_degraded(reason) public API; early return in rerank(), rerank_with_fusion(), rerank_with_modality_fusion() when degraded
truememory/telemetry.py @tracked async-aware: detects iscoroutinefunction and wraps with async_wrapper so FastMCP sees the correct function type
truememory/engine.py add() pre-computes content + separation embeddings BEFORE _write_lock — concurrent stores overlap on inference (~10-50 ms per encode, GIL released inside .encode())
tests/test_concurrent_store_hang.py New file (3 tests): engine concurrency, MCP handler async shape, asyncio.gather end-to-end. Includes truememory_status in the async-handler list
tests/test_cold_start_resilience.py New file (14 tests): WNOHANG guard, degraded-flag lifecycle, watchdog timeout + fast-load non-regression, _set_reranker short-circuit, health-payload wiring on both timeout and exception paths, timeout-parser validation (positive / zero / negative / non-integer / unset)
tests/test_health_stats.py Wraps now-async truememory_stats() in asyncio.run()
tests/ingest/test_robustness_fixes.py 4 @pytest.mark.skipif decorators: os.geteuid()not hasattr(os, "geteuid") or os.geteuid() == 0. Same POSIX-only-API pattern as WNOHANG; was crashing pytest collection on Windows
CHANGELOG.md New ## [Unreleased] section documenting all of the above

Test Plan

  • python -m pytest tests/test_cold_start_resilience.py -v → 14 passed
  • python -m pytest tests/test_concurrent_store_hang.py -v → 3 passed
  • Cold start: truememory-mcp reaches "server ready" within 30 s, no boot hang
  • Windows: no AttributeError: module 'os' has no attribute 'WNOHANG' in logs
  • TRUEMEMORY_LAZY_MODELS=1 — server starts instantly, models load lazily on first search
  • Corrupt HF cache (rm -rf ~/.cache/huggingface/hub/models--BAAI--bge-reranker-v2-m3*): confirm "preload exceeded 30s" appears in logs AND truememory_stats.health.reranker.status == "degraded" AND truememory_search returns results (sans rerank)
  • TRUEMEMORY_RERANKER_TIMEOUT_SEC=0: WARNING logged at startup, watchdog still uses 30 s default

Design Notes

  • Why 30 s default? Cold load of BAAI/bge-reranker-v2-m3 on a typical
    machine with warm HF cache: ~3-8 s. First-time download over slow network:
    10-25 s. 30 s gives generous headroom while still surfacing real hangs.
  • Why no "disable watchdog" option? TRUEMEMORY_LAZY_MODELS=1 already
    covers the "I don't want preload" path. Allowing TIMEOUT_SEC=0 to mean
    "disable" is a footgun — a typo like TIMEOUT_SEC= in a shell script
    becomes 0 and silently disables the safety net.
  • Why not kill the orphaned reranker thread? Python threading has no
    safe kill primitive. The orphan thread sits at 0 % CPU using ~5 MB until
    process exit. Documented limitation; restart recovers cleanly.
  • Why not retry the reranker load? Once degraded, recovery requires
    process restart (clears any cache state too). Auto-retry hides root
    causes (HF cache corruption, ASR rule, network) that the user needs to
    fix once.

Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai

Merge ordering

Status: Currently CONFLICTING against origin/main after PR #343 (fix/resource-efficiency-v1) merged. Trivial rebase needed before merge — branch was cut off 9d18019; main is now at 6ec0be5. I'll push the rebase the moment Hunter or a reviewer gives the go.

Blocks: #351 (Windows subprocess portability — mcp_server.py:1181 backlog drainer Popen layers on top of this PR's async-handler conversion).

Closes: #318 — Hunter's May-14 commit 5747b99 is the cherry-pick base of this PR's 68989d0. This PR is a strict superset (adds truememory_status async fix, Windows WNOHANG guard, reranker watchdog + degraded mode + _set_reranker short-circuit + health-payload wiring, regression tests, os.geteuid test fix, CHANGELOG).

Recommended merge sequence (10 PRs from a 3-agent coordination — see ~/.claude/handoffs/truememory-coordination-registry.md on Hunter's machine):

#353 (CI runner) → #344 (this) → #345 → #346 → #351 → #348 → #352 → #347 → #349 → #350

Huntehhh and others added 8 commits May 16, 2026 16:24
…ore hang

Resolves the 10-15s harness hang when 3+ truememory_store or search MCP
calls fire in parallel. Three layered changes:

1. mcp_server.py — 7 hot-path @mcp.tool() handlers (store / search /
   search_deep / get / forget / stats / entity_profile) changed from
   sync `def` to `async def`. Engine calls run via
   `await asyncio.to_thread(...)` so FastMCP's event-loop thread stays
   free for concurrent JSON-RPC requests. truememory_configure stays
   sync — heavy state mutation, called once at setup.

2. telemetry.py — `@tracked` is now async-aware. Wrapping an `async def`
   in the old sync wrapper produced an unawaited coroutine object that
   silently defeated the async-ification.

3. engine.py — `add()` pre-computes both content + separation embeddings
   OUTSIDE `_write_lock`. Previously the lock was held during the two
   ~10-50ms model.encode() calls, serializing all concurrent stores.
   PyTorch releases the GIL inside .encode(), so concurrent stores can
   now overlap on inference; they only contend at the INSERTs (μs).

Tests:
- tests/test_concurrent_store_hang.py (new): three regression locks —
  threaded engine.add(), MCP handler-shape check, asyncio.gather()
  end-to-end.
- tests/test_health_stats.py: wrap the now-async truememory_stats() in
  asyncio.run().

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
The May 14 async-handler commit (5747b99) didn't cover truememory_status
because it didn't exist yet at that base. status calls
RebuildManager.get_status which opens a SQLite connection — blocking I/O
that should run via asyncio.to_thread like the other async handlers.
Otherwise polling during a rebuild would block the event loop.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
os.WNOHANG is POSIX-only; on Windows, os has no WNOHANG attribute and the
backlog drainer thread crashes with AttributeError on every boot for every
Windows user. Add a hasattr guard at the top of _reap_children so it
returns cleanly on platforms without the syscall.

Windows doesn't need zombie reaping anyway — terminated children release
their PID immediately, so there are no defunct processes to clean up.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
A stalled reranker preload (corrupt HuggingFace cache, blocked download,
Windows Defender ASR denying a sentencepiece shim) currently leaves the
preload thread alive forever. Every subsequent search call then blocks
inside get_reranker() waiting on the same stalled load — which is why a
"frozen MCP" session always also has stats/search/forget hanging.

Three coordinated changes:

1. reranker.py — module-level _load_failed flag, is_degraded() reader,
   mark_degraded(reason) setter (logs once). rerank / rerank_with_fusion /
   rerank_with_modality_fusion check the flag at entry and return the
   original candidate ordering rather than calling get_reranker.

2. mcp_server.py _preload_models — _load_reranker now also calls
   mark_degraded on exception. A new watchdog thread joins on the loader
   for TRUEMEMORY_RERANKER_TIMEOUT_SEC (default 30s) and marks degraded
   if still alive. The loader thread is orphaned (Python can't safely
   kill threads) but no caller blocks on it.

3. mcp_server.py — new _RERANKER_LOAD_TIMEOUT_SEC constant near the
   existing _MODEL_IDLE_TIMEOUT_SEC for style consistency.

Result: cold start emits either a successful reranker load OR a single
WARNING line within the timeout window, and the server stays responsive
in degraded mode (non-reranked search). On next process restart the
degraded flag clears and load is retried.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
- test_cold_start_resilience.py (new): pins down the 4 fixes from this PR.
  WNOHANG guard returns cleanly when os has no WNOHANG. is_degraded starts
  False, mark_degraded sets it. rerank / rerank_with_modality_fusion return
  original ordering when degraded (proves get_reranker is never called).
  Preload watchdog marks degraded on a stalled load and does NOT mark on a
  fast load (no false-positive regression).

- test_concurrent_store_hang.py: extend the async-handler regression list
  to include truememory_status, which the May 14 commit predated.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Four test decorators called os.geteuid() at module-import time to skip
when running as POSIX root. WNOHANG-sister bug — geteuid is POSIX-only,
so on Windows the import itself raised AttributeError and pytest could
not even collect the file. Symptom: full test suite errored at collection
on every Windows contributor's machine.

Wrap each decorator with a hasattr check so the test skips cleanly on
Windows (where chmod 555 / chmod 000 don't enforce read-only the same
way on NTFS as on POSIX anyway).

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
…out validation

Three follow-on fixes on top of the watchdog/degraded-flag commit so the
cold-start fix actually delivers in production conditions:

1. _set_reranker now short-circuits when reranker.is_degraded(). Without
   this, every truememory_search / truememory_search_deep call would still
   block — not on the event loop (the async-handler fix protects that) but
   on the reranker._lock that the stalled preload thread is holding. So
   the thread pool serializes instead, eating slots until exhausted.

2. The watchdog and the preload exception path now both call
   _record_reranker_error in addition to mark_degraded, so the F06 health
   payload (truememory_stats.health.reranker) reflects the degraded state
   immediately. Without this, operators would see status="ok" while search
   is silently falling back.

3. TRUEMEMORY_RERANKER_TIMEOUT_SEC is now parsed through
   _parse_reranker_timeout — non-positive and non-integer values fall back
   to the default with a warning. A typo like an unset env var
   (TRUEMEMORY_RERANKER_TIMEOUT_SEC= in a shell script becomes 0) used to
   make thread.join(timeout=0) return immediately, marking degraded on
   every boot. The legitimate "skip preload" path is TRUEMEMORY_LAZY_MODELS=1.

Includes regression tests for all three paths.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Documents the seven commits of this PR series — async-handler conversion,
status async fix, WNOHANG guard, reranker watchdog + degraded fallback,
test portability, _set_reranker short-circuit, health-payload wiring, and
TIMEOUT_SEC validation — plus the new public surface
(TRUEMEMORY_RERANKER_TIMEOUT_SEC env var, reranker.is_degraded /
mark_degraded helpers, two regression test files).

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant