fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344
Open
Huntehhh wants to merge 8 commits into
Open
fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability#344Huntehhh wants to merge 8 commits into
Huntehhh wants to merge 8 commits into
Conversation
…ore hang Resolves the 10-15s harness hang when 3+ truememory_store or search MCP calls fire in parallel. Three layered changes: 1. mcp_server.py — 7 hot-path @mcp.tool() handlers (store / search / search_deep / get / forget / stats / entity_profile) changed from sync `def` to `async def`. Engine calls run via `await asyncio.to_thread(...)` so FastMCP's event-loop thread stays free for concurrent JSON-RPC requests. truememory_configure stays sync — heavy state mutation, called once at setup. 2. telemetry.py — `@tracked` is now async-aware. Wrapping an `async def` in the old sync wrapper produced an unawaited coroutine object that silently defeated the async-ification. 3. engine.py — `add()` pre-computes both content + separation embeddings OUTSIDE `_write_lock`. Previously the lock was held during the two ~10-50ms model.encode() calls, serializing all concurrent stores. PyTorch releases the GIL inside .encode(), so concurrent stores can now overlap on inference; they only contend at the INSERTs (μs). Tests: - tests/test_concurrent_store_hang.py (new): three regression locks — threaded engine.add(), MCP handler-shape check, asyncio.gather() end-to-end. - tests/test_health_stats.py: wrap the now-async truememory_stats() in asyncio.run(). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
The May 14 async-handler commit (5747b99) didn't cover truememory_status because it didn't exist yet at that base. status calls RebuildManager.get_status which opens a SQLite connection — blocking I/O that should run via asyncio.to_thread like the other async handlers. Otherwise polling during a rebuild would block the event loop. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
os.WNOHANG is POSIX-only; on Windows, os has no WNOHANG attribute and the backlog drainer thread crashes with AttributeError on every boot for every Windows user. Add a hasattr guard at the top of _reap_children so it returns cleanly on platforms without the syscall. Windows doesn't need zombie reaping anyway — terminated children release their PID immediately, so there are no defunct processes to clean up. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
A stalled reranker preload (corrupt HuggingFace cache, blocked download, Windows Defender ASR denying a sentencepiece shim) currently leaves the preload thread alive forever. Every subsequent search call then blocks inside get_reranker() waiting on the same stalled load — which is why a "frozen MCP" session always also has stats/search/forget hanging. Three coordinated changes: 1. reranker.py — module-level _load_failed flag, is_degraded() reader, mark_degraded(reason) setter (logs once). rerank / rerank_with_fusion / rerank_with_modality_fusion check the flag at entry and return the original candidate ordering rather than calling get_reranker. 2. mcp_server.py _preload_models — _load_reranker now also calls mark_degraded on exception. A new watchdog thread joins on the loader for TRUEMEMORY_RERANKER_TIMEOUT_SEC (default 30s) and marks degraded if still alive. The loader thread is orphaned (Python can't safely kill threads) but no caller blocks on it. 3. mcp_server.py — new _RERANKER_LOAD_TIMEOUT_SEC constant near the existing _MODEL_IDLE_TIMEOUT_SEC for style consistency. Result: cold start emits either a successful reranker load OR a single WARNING line within the timeout window, and the server stays responsive in degraded mode (non-reranked search). On next process restart the degraded flag clears and load is retried. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
- test_cold_start_resilience.py (new): pins down the 4 fixes from this PR. WNOHANG guard returns cleanly when os has no WNOHANG. is_degraded starts False, mark_degraded sets it. rerank / rerank_with_modality_fusion return original ordering when degraded (proves get_reranker is never called). Preload watchdog marks degraded on a stalled load and does NOT mark on a fast load (no false-positive regression). - test_concurrent_store_hang.py: extend the async-handler regression list to include truememory_status, which the May 14 commit predated. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Four test decorators called os.geteuid() at module-import time to skip when running as POSIX root. WNOHANG-sister bug — geteuid is POSIX-only, so on Windows the import itself raised AttributeError and pytest could not even collect the file. Symptom: full test suite errored at collection on every Windows contributor's machine. Wrap each decorator with a hasattr check so the test skips cleanly on Windows (where chmod 555 / chmod 000 don't enforce read-only the same way on NTFS as on POSIX anyway). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
…out validation Three follow-on fixes on top of the watchdog/degraded-flag commit so the cold-start fix actually delivers in production conditions: 1. _set_reranker now short-circuits when reranker.is_degraded(). Without this, every truememory_search / truememory_search_deep call would still block — not on the event loop (the async-handler fix protects that) but on the reranker._lock that the stalled preload thread is holding. So the thread pool serializes instead, eating slots until exhausted. 2. The watchdog and the preload exception path now both call _record_reranker_error in addition to mark_degraded, so the F06 health payload (truememory_stats.health.reranker) reflects the degraded state immediately. Without this, operators would see status="ok" while search is silently falling back. 3. TRUEMEMORY_RERANKER_TIMEOUT_SEC is now parsed through _parse_reranker_timeout — non-positive and non-integer values fall back to the default with a warning. A typo like an unset env var (TRUEMEMORY_RERANKER_TIMEOUT_SEC= in a shell script becomes 0) used to make thread.join(timeout=0) return immediately, marking degraded on every boot. The legitimate "skip preload" path is TRUEMEMORY_LAZY_MODELS=1. Includes regression tests for all three paths. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Documents the seven commits of this PR series — async-handler conversion, status async fix, WNOHANG guard, reranker watchdog + degraded fallback, test portability, _set_reranker short-circuit, health-payload wiring, and TIMEOUT_SEC validation — plus the new public surface (TRUEMEMORY_RERANKER_TIMEOUT_SEC env var, reranker.is_degraded / mark_degraded helpers, two regression test files). Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
This was referenced May 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Three compounding failure modes caused the MCP server to hang or crash on cold
start. Plus a sibling Windows-portability fix in
pytesttest decorators(same
hasattr-guard theme as the WNOHANG fix; was blockingpytestcollection on Windows entirely).
Sync handlers — all 9
@mcp.tool()handlers were syncdef, so oneblocked handler froze every concurrent JSON-RPC request on FastMCP's
single asyncio thread. Converted to
async defwith engine calls wrappedin
asyncio.to_thread.@trackedis now async-aware so it doesn'tre-wrap coroutines as sync functions.
Reranker preload hangs —
CrossEncoder(...)in_preload_modelsblocks indefinitely on corrupt HuggingFace cache, stalled download, or
Windows Defender ASR denying a sentencepiece shim. Added a 30 s watchdog
(
TRUEMEMORY_RERANKER_TIMEOUT_SECoverride); on timeout, marksdegraded and rerank entrypoints fall back to original-ordering results.
_set_rerankershort-circuits when degraded so search calls don'tblock on the stalled load's lock. Degraded state surfaces in the F06
health payload (
truememory_stats.health.reranker).os.WNOHANGis POSIX-only —_reap_childrencalledos.waitpid(-1, os.WNOHANG), crashing every Windows user's backlogdrainer with
AttributeErroron every boot. Guarded withhasattr.Also:
engine.add()now pre-computes embeddings outside_write_locksoconcurrent stores overlap on inference instead of serializing.
Changes
truememory/mcp_server.pyasync def+asyncio.to_thread; reranker watchdog withTRUEMEMORY_RERANKER_TIMEOUT_SEC(default 30 s, validated);_reap_childrenPOSIX guard;_set_rerankershort-circuit on degraded; watchdog wires_record_reranker_errorso health payload reflects truth;_parse_reranker_timeouthelpertruememory/reranker.py_load_failedflag;is_degraded()+mark_degraded(reason)public API; early return inrerank(),rerank_with_fusion(),rerank_with_modality_fusion()when degradedtruememory/telemetry.py@trackedasync-aware: detectsiscoroutinefunctionand wraps withasync_wrapperso FastMCP sees the correct function typetruememory/engine.pyadd()pre-computes content + separation embeddings BEFORE_write_lock— concurrent stores overlap on inference (~10-50 ms per encode, GIL released inside.encode())tests/test_concurrent_store_hang.pyasyncio.gatherend-to-end. Includestruememory_statusin the async-handler listtests/test_cold_start_resilience.py_set_rerankershort-circuit, health-payload wiring on both timeout and exception paths, timeout-parser validation (positive / zero / negative / non-integer / unset)tests/test_health_stats.pytruememory_stats()inasyncio.run()tests/ingest/test_robustness_fixes.py@pytest.mark.skipifdecorators:os.geteuid()→not hasattr(os, "geteuid") or os.geteuid() == 0. Same POSIX-only-API pattern as WNOHANG; was crashingpytestcollection on WindowsCHANGELOG.md## [Unreleased]section documenting all of the aboveTest Plan
python -m pytest tests/test_cold_start_resilience.py -v→ 14 passedpython -m pytest tests/test_concurrent_store_hang.py -v→ 3 passedtruememory-mcpreaches "server ready" within 30 s, no boot hangAttributeError: module 'os' has no attribute 'WNOHANG'in logsTRUEMEMORY_LAZY_MODELS=1— server starts instantly, models load lazily on first searchrm -rf ~/.cache/huggingface/hub/models--BAAI--bge-reranker-v2-m3*): confirm "preload exceeded 30s" appears in logs ANDtruememory_stats.health.reranker.status == "degraded"ANDtruememory_searchreturns results (sans rerank)TRUEMEMORY_RERANKER_TIMEOUT_SEC=0: WARNING logged at startup, watchdog still uses 30 s defaultDesign Notes
BAAI/bge-reranker-v2-m3on a typicalmachine with warm HF cache: ~3-8 s. First-time download over slow network:
10-25 s. 30 s gives generous headroom while still surfacing real hangs.
TRUEMEMORY_LAZY_MODELS=1alreadycovers the "I don't want preload" path. Allowing
TIMEOUT_SEC=0to mean"disable" is a footgun — a typo like
TIMEOUT_SEC=in a shell scriptbecomes
0and silently disables the safety net.threadinghas nosafe kill primitive. The orphan thread sits at 0 % CPU using ~5 MB until
process exit. Documented limitation; restart recovers cleanly.
process restart (clears any cache state too). Auto-retry hides root
causes (HF cache corruption, ASR rule, network) that the user needs to
fix once.
Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai
Merge ordering
Status: Currently
CONFLICTINGagainstorigin/mainafter PR #343 (fix/resource-efficiency-v1) merged. Trivial rebase needed before merge — branch was cut off9d18019; main is now at6ec0be5. I'll push the rebase the moment Hunter or a reviewer gives the go.Blocks: #351 (Windows subprocess portability —
mcp_server.py:1181backlog drainer Popen layers on top of this PR's async-handler conversion).Closes: #318 — Hunter's May-14 commit
5747b99is the cherry-pick base of this PR's68989d0. This PR is a strict superset (addstruememory_statusasync fix, WindowsWNOHANGguard, reranker watchdog + degraded mode +_set_rerankershort-circuit + health-payload wiring, regression tests,os.geteuidtest fix, CHANGELOG).Recommended merge sequence (10 PRs from a 3-agent coordination — see
~/.claude/handoffs/truememory-coordination-registry.mdon Hunter's machine):