fix(engine, tier_switch): silent-failure + concurrency hardening by Huntehhh · Pull Request #347 · buildingjoshbetter/TrueMemory

Huntehhh · 2026-05-16T23:06:07Z

Summary

Six robustness fixes surfaced by a 2026-05-16 cross-platform audit
(Gemini 3.1 Pro + Grok 4.3 + five sub-agents). All in code paths that
fail silently or under load conditions a single user wouldn't see in
testing.

engine.delete_all silently fails for users with >999 messages —
IN (?, ?, ...) clauses unchunked, hitting
SQLITE_MAX_VARIABLE_NUMBER default of 999; the existing
except Exception: logger.warning caught the resulting
OperationalError and silently leaked rows from fact_timeline,
landmark_events, causal_edges, vec_messages*, episodes. Added
_delete_in_chunks helper batching at 500.
search_agentic skips LLM-rerank fallback when cross-encoder is
degraded — rerank_with_modality_fusion returns the original
ordering without fused_score when degraded; the previous code
returned that and bypassed the LLM-rerank block. Now detects via
result-key inspection and falls through.
sqlite-vec load failure invisible in health payload — DEBUG-
only log meant truememory_stats.health.vectors read as "ok" while
search was silently in FTS-only mode. Now writes to module-level
_vectors_load_error and logs at WARNING.
vector_search separation-vector failure logged at DEBUG —
silently dropped one memory from every future sender-aware search.
Promoted to WARNING with explicit context.
reranker.get_reranker fast-path TOCTOU — _model and
_model_name read as two separate globals; GIL release between them
under concurrent set_active_tier could return the old model under
the new name. Bundled into single tuple unpack.
RebuildManager check-then-write race + conn leaks — two
concurrent truememory_configure calls could both spawn rebuild
threads racing on the same DB. Added _state_lock around every
read / write of _active_thread and _active_worker. Wrapped
start_rebuild and run_rebuild_sync in try / finally to fix
conn-handle leaks on exception paths. Added encoding="utf-8" to
the config.json read in _apply_config_switch (silent tier-switch
loss on Windows cp1252 default with non-ASCII bytes).

This PR is intentionally scoped to engine, reranker, vector_search,
and tier_switch — the parallel Windows-portability work
(fix/windows-asr-trampoline-bypass) covers
install.ps1 / install.sh / mcp_server.py / model_server.py /
model_client.py / hooks/core.py / ingest/hooks/* / ingest/cli.py
and is disjoint from this change — both can land in either order.

Changes

File	Change
`truememory/engine.py`	New `_SQLITE_IN_CHUNK = 500` + `_delete_in_chunks(conn, table, col, ids)` helper; `delete_all(user_id=...)` re-uses helper at 5 call sites; `search_agentic` checks for `fused_score` / `rerank_score` to detect degraded reranker and fall through to LLM fallback; sqlite-vec load failure writes to `_vectors_load_error` and logs at WARNING
`truememory/reranker.py`	`get_reranker` fast path bundles `_model` / `_model_name` into single tuple unpack (both fast-path AND double-checked lock branches)
`truememory/vector_search.py`	`build_separation_vector_single` failure logs at WARNING with sender-aware-search context (was DEBUG)
`truememory/tier_switch/manager.py`	New `_state_lock` (threading.Lock) in `__init__`; `start_rebuild` / `run_rebuild_sync` / `_rebuild_thread` / `cancel` all read/write `_active_thread` and `_active_worker` under the lock; `start_rebuild` and `run_rebuild_sync` wrapped in `try / finally` for conn cleanup; `_apply_config_switch` reads `config.json` with `encoding="utf-8"`
`tests/test_engine_tier_switch_hardening.py`	New file, 10 regression tests covering all 6 fixes
`CHANGELOG.md`	Unreleased section documenting every fix

Test Plan

python -m pytest tests/test_engine_tier_switch_hardening.py -v → 9 passed + 1 skipped (signature probe)
Spot-check: a delete_all invocation with >1000 messages no longer drops related-table cleanup
Spot-check: with reranker degraded (TRUEMEMORY_RERANKER_TIMEOUT_SEC=1, corrupt HF cache), search_agentic still applies LLM rerank
Spot-check: with sqlite-vec broken, truememory_stats.health.vectors.status shows degraded not ok
Two concurrent truememory_configure calls produce one rebuild log line, not two

Design Notes

Why 500 for _SQLITE_IN_CHUNK? Stays well under the conservative 999 floor; tuning above 500 buys little because the dominant cost is the per-batch WHERE-clause evaluation, not the round trip. Floor is also asserted in the test suite.
Why source-inspection for some tests? The get_reranker TOCTOU window and the search_agentic degraded fallthrough are both narrow enough that direct runtime tests would require either threading harnesses (flaky) or full engine fixtures (heavy). The structural checks (inspect.getsource + substring) catch the common accidental-revert case, which is the realistic threat model.
Why not a single shared helper across the 3 engine fixes? Each fix is in a different region of engine.py (lines ~340 / ~520 / ~1730) and the surrounding contexts differ enough that extracting a shared helper would create more reading overhead than it saves.

Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai

Merge ordering

Status: MERGEABLE (clean against origin/main; mergeable but UNSTABLE per GitHub — CI signal pending or partially red, will investigate if blocking).

Order-independent. Disjoint files from every other open PR:

engine.py touched here is in different regions from fix(mcp): cold-start resilience — async handlers, reranker timeout, Windows portability #344's add() lock-shrink (lines ~440 vs lines ~500/1730/340) — mechanical inter-PR merge.
reranker.py, vector_search.py, tier_switch/manager.py not touched by any other open PR.

Can ship anywhere in the queue without dependency reshuffling.

Full sequence (10 PRs from a 3-agent coordination):

#353 (CI runner) → #344 → #345 → #346 → #351 → #348 → #352 → #347 (this — float) → #349 → #350

Three issues surfaced in the 2026-05-16 robustness audit. All in engine.py. 1. `delete_all(user_id=...)` builds `DELETE FROM <t> WHERE <col> IN (?, ?, ...)` clauses from the full `msg_ids` and `episode_ids` lists. On SQLite builds with the default `SQLITE_MAX_VARIABLE_NUMBER=999` (still the value on Debian/Ubuntu system packages as of 2026), any user with >999 stored messages hits `OperationalError`, caught by the existing `except Exception: logger.warning(...)` — every related-table cleanup for that user silently failed, leaking `fact_timeline`, `landmark_events`, `causal_edges`, `vec_messages`, `vec_messages_sep`, and `episodes` rows. Added `_delete_in_chunks` helper that batches IN-clause deletes at 500 ids per round trip; applied at all five affected call sites. 2. `search_agentic` returned the modality-fusion result unchanged and skipped the LLM-rerank fallback block (lines ~1758) even when the cross-encoder was degraded. When degraded, `rerank_with_modality_fusion` returns the original ordering with no `fused_score` / `rerank_score` keys. Detect this by inspecting the first result's keys; on degraded-mode detection, log at debug and fall through to the LLM fallback block instead of returning the un-reranked results. This restores the higher-quality option the fallback exists for. 3. `_ensure_connection`'s `except Exception: logger.debug(...)` on sqlite-vec extension load failure swallowed the error without writing to the module-level `_vectors_load_error` tracker. The `truememory_stats.health.vectors` field that operators rely on read as "ok" while search was silently in FTS-only fallback mode. Now writes the error string to `_vectors_load_error` and logs at WARNING so the degradation surfaces in both the health payload and the log stream. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

reranker.get_reranker fast path read `_model` and `_model_name` as two separate global lookups. Under a concurrent `set_active_tier` (called during tier-switch via mcp_server) the GIL could release between the two reads, returning the previous model under the new name and silently serving the wrong reranker for one search before the next call. Bundle the reads into a single tuple unpack — `cached_model, cached_name = _model, _model_name` is a single STORE_NAME/STORE_FAST bytecode op so the GIL cannot release mid-read. Same change inside the lock for the double-checked pattern. vector_search.build_separation_vector_single's silent-loss path (`except Exception: logger.debug(...)`) was the worst kind of failure: the row is in `messages` and `vec_messages` but the separation row was silently dropped, so every future sender-aware search will miss this memory. Promoted to WARNING with explicit "sender-aware search will not surface this row" context so operators see the per-memory loss instead of needing to enable DEBUG-level logging. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Three issues in RebuildManager (truememory/tier_switch/manager.py): 1. start_rebuild did `if self._active_thread and is_alive()` (line ~100) then later assigned `self._active_thread = thread` (line ~136) with NO lock between the check and the write. Two concurrent truememory_configure calls could both observe is_alive()==False, both create RebuildWorker instances, and both spawn rebuild threads racing on the same DB — leaving the tier in an indeterminate state that requires manual recovery. Added `_state_lock` (threading.Lock) in `__init__`; wrapped every read / write of `_active_thread` and `_active_worker` across `start_rebuild`, `run_rebuild_sync`, `_rebuild_thread`, and `cancel`. 2. `start_rebuild` and `run_rebuild_sync` opened a SQLite connection then closed it on each early-return branch individually. Any exception from `tier_group()`, `preflight_ram_check()`, `resolve_rebuild_action()`, `get_messages_to_embed()`, `_create_status_row()`, or `backup_db()` skipped the per-branch close and leaked the handle. Wrapped both methods in a single `try / finally: conn.close()` covering all exit paths. 3. `_apply_config_switch` parses `config.json` via `config_path.read_text()` with NO `encoding=` arg. On Windows the default codec is cp1252 — a non-ASCII byte (a Cyrillic / accented character in an API key or a Cohere endpoint name) crashes the parse and silently drops the tier switch. Added `encoding="utf-8"` to match every other JSON-reading path in the codebase. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

10 tests in tests/test_engine_tier_switch_hardening.py pin the four bug classes addressed by this PR: - `_delete_in_chunks` helper: empty / single / chunked / floor-below-999 invariants. Catches a future bump of `_SQLITE_IN_CHUNK` above 999 before it lands in users' hands. - `reranker.get_reranker` fast-path: source-inspection check for the tuple-bundle pattern. Hard to test runtime atomicity directly because Python's GIL makes the bug rare; the structural check guards the fix against accidental revert. - `RebuildManager`: `_state_lock` attribute exists, `cancel()` reads under lock, two concurrent `start_rebuild` callers result in at most one spawned thread. - `search_agentic` degraded fallthrough: source-inspection check that the `_cross_encoder_ran` guard exists. End-to-end runtime test would require a full engine fixture; structural check covers the common refactor-revert case. - `vector_search` separation-vector failure: log level is WARNING (skipped on builds where the private helper signature differs). CHANGELOG.md gains an `Unreleased` section documenting every fix with the "what broke / why it matters / what changed" pattern that matches the existing 0.6.x entries. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Two F401 violations surfaced by upstream CI on PR buildingjoshbetter#347: - `unittest.mock.patch` left over from an earlier reload-based test approach that was refactored to monkeypatch in commit 471a98a. - `_delete_in_chunks` imported alongside `_SQLITE_IN_CHUNK` in the floor-test but never actually called there — only the constant matters for that assertion. Tests still 9 passed + 1 skipped locally; ruff is now clean. Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>

Huntehhh and others added 4 commits May 16, 2026 18:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(engine, tier_switch): silent-failure + concurrency hardening#347

fix(engine, tier_switch): silent-failure + concurrency hardening#347
Huntehhh wants to merge 5 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/engine-tier-switch-hardening

Huntehhh commented May 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Huntehhh commented May 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test Plan

Design Notes

Merge ordering

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Huntehhh commented May 16, 2026 •

edited

Loading