Skip to content

fix(engine, tier_switch): silent-failure + concurrency hardening#347

Open
Huntehhh wants to merge 5 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/engine-tier-switch-hardening
Open

fix(engine, tier_switch): silent-failure + concurrency hardening#347
Huntehhh wants to merge 5 commits into
buildingjoshbetter:mainfrom
Huntehhh:fix/engine-tier-switch-hardening

Conversation

@Huntehhh
Copy link
Copy Markdown
Contributor

@Huntehhh Huntehhh commented May 16, 2026

Summary

Six robustness fixes surfaced by a 2026-05-16 cross-platform audit
(Gemini 3.1 Pro + Grok 4.3 + five sub-agents). All in code paths that
fail silently or under load conditions a single user wouldn't see in
testing.

  1. engine.delete_all silently fails for users with >999 messages
    IN (?, ?, ...) clauses unchunked, hitting
    SQLITE_MAX_VARIABLE_NUMBER default of 999; the existing
    except Exception: logger.warning caught the resulting
    OperationalError and silently leaked rows from fact_timeline,
    landmark_events, causal_edges, vec_messages*, episodes. Added
    _delete_in_chunks helper batching at 500.

  2. search_agentic skips LLM-rerank fallback when cross-encoder is
    degraded
    rerank_with_modality_fusion returns the original
    ordering without fused_score when degraded; the previous code
    returned that and bypassed the LLM-rerank block. Now detects via
    result-key inspection and falls through.

  3. sqlite-vec load failure invisible in health payload — DEBUG-
    only log meant truememory_stats.health.vectors read as "ok" while
    search was silently in FTS-only mode. Now writes to module-level
    _vectors_load_error and logs at WARNING.

  4. vector_search separation-vector failure logged at DEBUG
    silently dropped one memory from every future sender-aware search.
    Promoted to WARNING with explicit context.

  5. reranker.get_reranker fast-path TOCTOU_model and
    _model_name read as two separate globals; GIL release between them
    under concurrent set_active_tier could return the old model under
    the new name. Bundled into single tuple unpack.

  6. RebuildManager check-then-write race + conn leaks — two
    concurrent truememory_configure calls could both spawn rebuild
    threads racing on the same DB. Added _state_lock around every
    read / write of _active_thread and _active_worker. Wrapped
    start_rebuild and run_rebuild_sync in try / finally to fix
    conn-handle leaks on exception paths. Added encoding="utf-8" to
    the config.json read in _apply_config_switch (silent tier-switch
    loss on Windows cp1252 default with non-ASCII bytes).

This PR is intentionally scoped to engine, reranker, vector_search,
and tier_switch
— the parallel Windows-portability work
(fix/windows-asr-trampoline-bypass) covers
install.ps1 / install.sh / mcp_server.py / model_server.py /
model_client.py / hooks/core.py / ingest/hooks/* / ingest/cli.py
and is disjoint from this change — both can land in either order.

Changes

File Change
truememory/engine.py New _SQLITE_IN_CHUNK = 500 + _delete_in_chunks(conn, table, col, ids) helper; delete_all(user_id=...) re-uses helper at 5 call sites; search_agentic checks for fused_score / rerank_score to detect degraded reranker and fall through to LLM fallback; sqlite-vec load failure writes to _vectors_load_error and logs at WARNING
truememory/reranker.py get_reranker fast path bundles _model / _model_name into single tuple unpack (both fast-path AND double-checked lock branches)
truememory/vector_search.py build_separation_vector_single failure logs at WARNING with sender-aware-search context (was DEBUG)
truememory/tier_switch/manager.py New _state_lock (threading.Lock) in __init__; start_rebuild / run_rebuild_sync / _rebuild_thread / cancel all read/write _active_thread and _active_worker under the lock; start_rebuild and run_rebuild_sync wrapped in try / finally for conn cleanup; _apply_config_switch reads config.json with encoding="utf-8"
tests/test_engine_tier_switch_hardening.py New file, 10 regression tests covering all 6 fixes
CHANGELOG.md Unreleased section documenting every fix

Test Plan

  • python -m pytest tests/test_engine_tier_switch_hardening.py -v → 9 passed + 1 skipped (signature probe)
  • Spot-check: a delete_all invocation with >1000 messages no longer drops related-table cleanup
  • Spot-check: with reranker degraded (TRUEMEMORY_RERANKER_TIMEOUT_SEC=1, corrupt HF cache), search_agentic still applies LLM rerank
  • Spot-check: with sqlite-vec broken, truememory_stats.health.vectors.status shows degraded not ok
  • Two concurrent truememory_configure calls produce one rebuild log line, not two

Design Notes

  • Why 500 for _SQLITE_IN_CHUNK? Stays well under the conservative 999 floor; tuning above 500 buys little because the dominant cost is the per-batch WHERE-clause evaluation, not the round trip. Floor is also asserted in the test suite.
  • Why source-inspection for some tests? The get_reranker TOCTOU window and the search_agentic degraded fallthrough are both narrow enough that direct runtime tests would require either threading harnesses (flaky) or full engine fixtures (heavy). The structural checks (inspect.getsource + substring) catch the common accidental-revert case, which is the realistic threat model.
  • Why not a single shared helper across the 3 engine fixes? Each fix is in a different region of engine.py (lines ~340 / ~520 / ~1730) and the surrounding contexts differ enough that extracting a shared helper would create more reading overhead than it saves.

Co-Authored-By: claude-opus-4-7 wontreply@getfucked.ai

Merge ordering

Status: MERGEABLE (clean against origin/main; mergeable but UNSTABLE per GitHub — CI signal pending or partially red, will investigate if blocking).

Order-independent. Disjoint files from every other open PR:

Can ship anywhere in the queue without dependency reshuffling.

Full sequence (10 PRs from a 3-agent coordination):

#353 (CI runner) → #344 → #345 → #346 → #351 → #348 → #352 → #347 (this — float) → #349 → #350

Huntehhh and others added 4 commits May 16, 2026 18:04
Three issues surfaced in the 2026-05-16 robustness audit. All in engine.py.

1. `delete_all(user_id=...)` builds `DELETE FROM <t> WHERE <col> IN (?, ?, ...)`
   clauses from the full `msg_ids` and `episode_ids` lists. On SQLite builds
   with the default `SQLITE_MAX_VARIABLE_NUMBER=999` (still the value on
   Debian/Ubuntu system packages as of 2026), any user with >999 stored
   messages hits `OperationalError`, caught by the existing
   `except Exception: logger.warning(...)` — every related-table cleanup
   for that user silently failed, leaking `fact_timeline`,
   `landmark_events`, `causal_edges`, `vec_messages`, `vec_messages_sep`,
   and `episodes` rows. Added `_delete_in_chunks` helper that batches
   IN-clause deletes at 500 ids per round trip; applied at all five
   affected call sites.

2. `search_agentic` returned the modality-fusion result unchanged and
   skipped the LLM-rerank fallback block (lines ~1758) even when the
   cross-encoder was degraded. When degraded,
   `rerank_with_modality_fusion` returns the original ordering with no
   `fused_score` / `rerank_score` keys. Detect this by inspecting the
   first result's keys; on degraded-mode detection, log at debug and fall
   through to the LLM fallback block instead of returning the un-reranked
   results. This restores the higher-quality option the fallback exists
   for.

3. `_ensure_connection`'s `except Exception: logger.debug(...)` on
   sqlite-vec extension load failure swallowed the error without writing
   to the module-level `_vectors_load_error` tracker. The
   `truememory_stats.health.vectors` field that operators rely on read as
   "ok" while search was silently in FTS-only fallback mode. Now writes
   the error string to `_vectors_load_error` and logs at WARNING so the
   degradation surfaces in both the health payload and the log stream.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
reranker.get_reranker fast path read `_model` and `_model_name` as two
separate global lookups. Under a concurrent `set_active_tier` (called
during tier-switch via mcp_server) the GIL could release between the
two reads, returning the previous model under the new name and silently
serving the wrong reranker for one search before the next call. Bundle
the reads into a single tuple unpack — `cached_model, cached_name =
_model, _model_name` is a single STORE_NAME/STORE_FAST bytecode op so
the GIL cannot release mid-read. Same change inside the lock for the
double-checked pattern.

vector_search.build_separation_vector_single's silent-loss path
(`except Exception: logger.debug(...)`) was the worst kind of failure:
the row is in `messages` and `vec_messages` but the separation row was
silently dropped, so every future sender-aware search will miss this
memory. Promoted to WARNING with explicit "sender-aware search will not
surface this row" context so operators see the per-memory loss instead
of needing to enable DEBUG-level logging.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Three issues in RebuildManager (truememory/tier_switch/manager.py):

1. start_rebuild did `if self._active_thread and is_alive()` (line ~100)
   then later assigned `self._active_thread = thread` (line ~136) with
   NO lock between the check and the write. Two concurrent
   truememory_configure calls could both observe is_alive()==False,
   both create RebuildWorker instances, and both spawn rebuild threads
   racing on the same DB — leaving the tier in an indeterminate state
   that requires manual recovery. Added `_state_lock`
   (threading.Lock) in `__init__`; wrapped every read / write of
   `_active_thread` and `_active_worker` across `start_rebuild`,
   `run_rebuild_sync`, `_rebuild_thread`, and `cancel`.

2. `start_rebuild` and `run_rebuild_sync` opened a SQLite connection
   then closed it on each early-return branch individually. Any
   exception from `tier_group()`, `preflight_ram_check()`,
   `resolve_rebuild_action()`, `get_messages_to_embed()`,
   `_create_status_row()`, or `backup_db()` skipped the per-branch
   close and leaked the handle. Wrapped both methods in a single
   `try / finally: conn.close()` covering all exit paths.

3. `_apply_config_switch` parses `config.json` via
   `config_path.read_text()` with NO `encoding=` arg. On Windows the
   default codec is cp1252 — a non-ASCII byte (a Cyrillic / accented
   character in an API key or a Cohere endpoint name) crashes the parse
   and silently drops the tier switch. Added `encoding="utf-8"` to
   match every other JSON-reading path in the codebase.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
10 tests in tests/test_engine_tier_switch_hardening.py pin the four
bug classes addressed by this PR:

- `_delete_in_chunks` helper: empty / single / chunked / floor-below-999
  invariants. Catches a future bump of `_SQLITE_IN_CHUNK` above 999
  before it lands in users' hands.
- `reranker.get_reranker` fast-path: source-inspection check for the
  tuple-bundle pattern. Hard to test runtime atomicity directly because
  Python's GIL makes the bug rare; the structural check guards the fix
  against accidental revert.
- `RebuildManager`: `_state_lock` attribute exists, `cancel()` reads
  under lock, two concurrent `start_rebuild` callers result in at most
  one spawned thread.
- `search_agentic` degraded fallthrough: source-inspection check that
  the `_cross_encoder_ran` guard exists. End-to-end runtime test would
  require a full engine fixture; structural check covers the common
  refactor-revert case.
- `vector_search` separation-vector failure: log level is WARNING
  (skipped on builds where the private helper signature differs).

CHANGELOG.md gains an `Unreleased` section documenting every fix with
the "what broke / why it matters / what changed" pattern that matches
the existing 0.6.x entries.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Two F401 violations surfaced by upstream CI on PR buildingjoshbetter#347:
- `unittest.mock.patch` left over from an earlier reload-based test
  approach that was refactored to monkeypatch in commit 471a98a.
- `_delete_in_chunks` imported alongside `_SQLITE_IN_CHUNK` in the
  floor-test but never actually called there — only the constant
  matters for that assertion.

Tests still 9 passed + 1 skipped locally; ruff is now clean.

Co-Authored-By: claude-opus-4-7 <wontreply@getfucked.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant