Harden warm-recall degraded fallback and cache consistency by vsumner · Pull Request #257 · spacedriveapp/spacebot

vsumner · 2026-02-28T01:07:15Z

Problem

Warm recall fallback had a few correctness/resiliency gaps under degraded search/store conditions:

hybrid fallback could fail with the same store-layer failure mode it was meant to survive
forgotten memories could leak from warm cache in failure paths
concurrent warmup refresh/delete could race and reintroduce stale entries
warmup status could still report warm while warm-recall refresh had failed

Approach

This PR hardens warm-recall behavior with runtime-coordinated cache state and explicit failure signaling.

1) Runtime warm-cache coordination

added runtime warm-cache state:
- warm_recall_memories
- warm_recall_refreshed_at_unix_ms
- warm_recall_cache_lock
- warm_recall_cache_epoch
- warm_recall_inflight_forget_counts (refcounted in-flight forget IDs)

2) Hybrid fallback behavior

memory_recall now:
- uses warm cache as degraded fallback when hybrid search errors
- reads warm cache under coordination lock
- excludes in-flight forget IDs from warm fallback scoring
- returns explicit degraded metadata (degraded_fallback_used, degraded_fallback_error)

3) Delete path consistency/cancellation safety

memory_delete now:
- evicts warm cache entries even when memory is already forgotten
- uses refcounted InflightForgetGuard to exclude IDs while forget is in progress
- runs forget+evict flow in a spawned task so caller cancellation cannot split DB state and cache state
- bumps warm-cache epoch on cache mutations

4) Warmup refresh race hardening

cortex warm-recall refresh now:
- runs search outside warm-cache lock
- applies results only if epoch is unchanged (skip apply on concurrent mutation)
- records refresh failures into warmup errors so status correctly degrades

5) Wiring

branch/cortex-chat tool servers now construct runtime-aware MemoryRecallTool / MemoryDeleteTool.

Why this matches architecture

keeps recall fallback branch-scoped and in-process (no channel blocking)
preserves tool error-as-result behavior while improving degraded-mode correctness
does not change DB schema or migrations

Files changed

src/config.rs
src/tools/memory_recall.rs
src/tools/memory_delete.rs
src/agent/cortex.rs
src/tools.rs
src/main.rs
src/api/agents.rs

Verification

Targeted:

cargo test --lib tools::memory_recall
cargo test --lib tools::memory_delete
cargo test --lib agent::cortex

Repo gates:

just preflight
just gate-pr

All passed.

Review closure

Addressed external review findings around:

DB dependency in warm fallback path
stale forgotten-memory leakage from warm cache
delete/refresh race windows
cancellation and concurrent same-ID forget correctness

Final review rounds:

Rust correctness reviewer: PASS
Quality reviewer: PASS

coderabbitai · 2026-02-28T01:07:35Z

Walkthrough

Adds a runtime warm-recall memory cache and lifecycle: new RuntimeConfig fields, warm-recall refresh and eviction logic, warmup telemetry reporting warm_recall_count, and threads RuntimeConfig into memory tools and tool-server creation for runtime-aware memory operations.

Changes

Cohort / File(s)	Summary
Runtime config / cache state `src/config.rs`	Added five RuntimeConfig fields for warm-recall cache state and initialized them in `RuntimeConfig::new` (`warm_recall_memories`, `warm_recall_refreshed_at_unix_ms`, `warm_recall_cache_lock`, `warm_recall_inflight_forget_counts`, `warm_recall_cache_epoch`).
Agent warmup integration `src/agent/cortex.rs`	Added `refresh_warm_recall_memories` helper, integrated it into `run_warmup_once`, and included `warm_recall_count` in warmup success/failure telemetry and error handling.
Tool-server wiring / call sites `src/api/agents.rs`, `src/main.rs`, `src/tools.rs`	Threaded `runtime_config: Arc<RuntimeConfig>` into `create_cortex_chat_tool_server` and `create_branch_tool_server` signatures and call sites; updated tool server builders to pass runtime_config.
Memory recall (warm-cache + hybrid fallback) `src/tools/memory_recall.rs`	Added optional `runtime_config` field and `with_runtime` constructor; implemented warm-cache fetch/scoring helpers, hybrid fallback path using warm cache, and new output fields `degraded_fallback_used`/`degraded_fallback_error`.
Memory delete (eviction & inflight guards) `src/tools/memory_delete.rs`	Added optional `runtime_config` field and `with_runtime` constructor; implemented `evict_from_warm_cache`, `InflightForgetGuard`, helpers for removing cache entries, and telemetry `removed_from_warm_cache`; includes tests for eviction and inflight logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

fix(cortex): harden startup warmup and bulletin coordination #248: Modifies cortex warmup flow and guards around run_warmup_once, overlapping warmup telemetry and lifecycle changes.
feat: add IMAP email_search tool for branch read-back #246: Updates tool-server APIs to accept runtime_config and threads it through tool constructors—directly related to the new with_runtime variants.
Workers tab: full transcript viewer, live SSE streaming, introspection tool #192: Changes tool-server creation call-sites/signatures; touches the same client wiring where runtime_config is threaded.

Suggested reviewers

jamiepine

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Harden warm-recall degraded fallback and cache consistency' directly and accurately summarizes the main objective of the PR, which focuses on hardening warm-recall behavior and improving cache consistency under degraded conditions.
Description check	✅ Passed	The description is comprehensive and directly related to the changeset, detailing the problems addressed, approach taken, implementation across affected components, and verification steps performed.
Docstring Coverage	✅ Passed	Docstring coverage is 86.96% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools.rs`:
- Around line 468-473: Replace the non-runtime-aware memory recall
instantiation: change the call to MemoryRecallTool::new(memory_search.clone())
so it uses the runtime-aware constructor with the existing runtime_config; use
MemoryRecallTool::with_runtime(memory_search, runtime_config.clone()) (matching
how MemoryDeleteTool is created) so memory recall honors warm-cache degraded
fallback behavior. Ensure you pass the correct cloned/owned memory_search and a
cloned runtime_config to the with_runtime call.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2188746 and 87e7ea9.

📒 Files selected for processing (7)

src/agent/cortex.rs
src/api/agents.rs
src/config.rs
src/main.rs
src/tools.rs
src/tools/memory_delete.rs
src/tools/memory_recall.rs

src/tools.rs

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

src/tools/memory_recall.rs (1)

44-77: Consider releasing the cache lock before scoring to reduce contention.

The warm_recall_cache_lock is held while score_warm_memories iterates and scores all warm memories (lines 70-76). Since warm_memories is already an Arc snapshot and inflight_forget_ids is already collected, the scoring could run outside the lock:

♻️ Suggested refactor to minimize lock duration

     async fn warm_cache_results(
         &self,
         query: &str,
         memory_type: Option<MemoryType>,
         max_results: usize,
     ) -> Vec<MemorySearchResult> {
         let Some(runtime_config) = self.runtime_config.as_ref() else {
             return Vec::new();
         };
-        let _warm_recall_cache_guard = runtime_config.warm_recall_cache_lock.lock().await;
 
-        let now_unix_ms = chrono::Utc::now().timestamp_millis();
-        let refreshed_at_unix_ms = *runtime_config
-            .warm_recall_refreshed_at_unix_ms
-            .load()
-            .as_ref();
-        let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else {
-            return Vec::new();
-        };
-        let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1);
-        if age_secs > warmup_refresh_secs {
-            return Vec::new();
-        }
+        let (warm_memories, inflight_forget_ids) = {
+            let _guard = runtime_config.warm_recall_cache_lock.lock().await;
 
-        let warm_memories = runtime_config.warm_recall_memories.load();
-        let inflight_forget_ids = snapshot_inflight_forget_ids(runtime_config);
+            let now_unix_ms = chrono::Utc::now().timestamp_millis();
+            let refreshed_at_unix_ms = *runtime_config
+                .warm_recall_refreshed_at_unix_ms
+                .load()
+                .as_ref();
+            let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else {
+                return Vec::new();
+            };
+            let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1);
+            if age_secs > warmup_refresh_secs {
+                return Vec::new();
+            }
+
+            (
+                runtime_config.warm_recall_memories.load(),
+                snapshot_inflight_forget_ids(runtime_config),
+            )
+        };
+
         score_warm_memories(
             query,
             warm_memories.as_ref(),
             memory_type,
             max_results,
             &inflight_forget_ids,
         )
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/tools/memory_recall.rs` around lines 44 - 77, The cache lock in
warm_cache_results is held while calling score_warm_memories, causing
unnecessary contention; instead, capture the needed snapshot under the lock
(runtime_config.warm_recall_memories.load() into warm_memories and compute
inflight_forget_ids via snapshot_inflight_forget_ids(runtime_config)), then drop
the lock before calling score_warm_memories(query, warm_memories.as_ref(),
memory_type, max_results, &inflight_forget_ids). Update warm_cache_results to
acquire the warm_recall_cache_lock only for snapshotting and release it prior to
scoring so scoring runs on the copied warm_memories and inflight_forget_ids
without holding runtime_config.warm_recall_cache_lock.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/memory_delete.rs`:
- Around line 250-261: When store.forget() returns false (a concurrent-forget
race) treat it as already-forgotten rather than a hard failure: still acquire or
respect the warm_recall_cache_lock and call
MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and bump
runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so
degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped
where appropriate (or dropped after evicting) to avoid deadlocks. Locate the
logic around was_forgotten, inflight_forget_guard, runtime_config and
MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the
similar block at lines ~301-305) and change the false-path to perform eviction
and epoch increment instead of treating it as a hard error.

---

Nitpick comments:
In `@src/tools/memory_recall.rs`:
- Around line 44-77: The cache lock in warm_cache_results is held while calling
score_warm_memories, causing unnecessary contention; instead, capture the needed
snapshot under the lock (runtime_config.warm_recall_memories.load() into
warm_memories and compute inflight_forget_ids via
snapshot_inflight_forget_ids(runtime_config)), then drop the lock before calling
score_warm_memories(query, warm_memories.as_ref(), memory_type, max_results,
&inflight_forget_ids). Update warm_cache_results to acquire the
warm_recall_cache_lock only for snapshotting and release it prior to scoring so
scoring runs on the copied warm_memories and inflight_forget_ids without holding
runtime_config.warm_recall_cache_lock.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11de638 and 820e2ee.

📒 Files selected for processing (7)

src/agent/cortex.rs
src/api/agents.rs
src/config.rs
src/main.rs
src/tools.rs
src/tools/memory_delete.rs
src/tools/memory_recall.rs

🚧 Files skipped from review as they are similar to previous changes (1)

src/api/agents.rs

coderabbitai · 2026-02-28T02:38:49Z

src/tools/memory_delete.rs

+                    let mut removed_from_warm_cache = false;
+                    if was_forgotten {
+                        let _warm_recall_cache_guard =
+                            runtime_config.warm_recall_cache_lock.lock().await;
+                        removed_from_warm_cache =
+                            MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
+                        runtime_config
+                            .warm_recall_cache_epoch
+                            .fetch_add(1, Ordering::SeqCst);
+                    } else {
+                        drop(inflight_forget_guard);
+                    }


⚠️ Potential issue | 🟡 Minor

Handle concurrent-forget races as already-forgotten and still evict warm cache.

If store.forget() returns false after the earlier successful load, this is often a concurrent-forget race. The current path reports a hard failure and skips warm-cache eviction, which can leave stale degraded-fallback entries in racey paths.

🔧 Proposed adjustment

let mut removed_from_warm_cache = false; if was_forgotten { let _warm_recall_cache_guard = runtime_config.warm_recall_cache_lock.lock().await; removed_from_warm_cache = MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id); runtime_config .warm_recall_cache_epoch .fetch_add(1, Ordering::SeqCst); } else { + let _warm_recall_cache_guard = + runtime_config.warm_recall_cache_lock.lock().await; + removed_from_warm_cache = + MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id); + if removed_from_warm_cache { + runtime_config + .warm_recall_cache_epoch + .fetch_add(1, Ordering::SeqCst); + } drop(inflight_forget_guard); } @@ - } else { + } else { Ok(MemoryDeleteOutput { forgotten: false, - message: format!("Failed to forget memory {}.", args.memory_id), + message: format!( + "Memory {} was not newly forgotten (likely already forgotten). Warm cache evicted: {}.", + args.memory_id, removed_from_warm_cache + ), }) }

Also applies to: 301-305

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/tools/memory_delete.rs` around lines 250 - 261, When store.forget() returns false (a concurrent-forget race) treat it as already-forgotten rather than a hard failure: still acquire or respect the warm_recall_cache_lock and call MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and bump runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped where appropriate (or dropped after evicting) to avoid deadlocks. Locate the logic around was_forgotten, inflight_forget_guard, runtime_config and MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the similar block at lines ~301-305) and change the false-path to perform eviction and epoch increment instead of treating it as a hard error.

vsumner changed the title ~~Harden warm recall fallback and memory delete cache consistency~~ Harden warm-recall degraded fallback and cache consistency Feb 28, 2026

vsumner mentioned this pull request Feb 28, 2026

Worker stalls and stale completed status cause repeated irrelevant replies #258

Open

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

src/tools.rs Show resolved Hide resolved

vsumner added 2 commits February 27, 2026 21:31

Harden warm recall fallback and delete cache consistency

5155848

Use runtime-aware recall in cortex chat tools

820e2ee

vsumner force-pushed the fix/warm-recall-fallback-hardening branch from 11de638 to 820e2ee Compare February 28, 2026 02:32

coderabbitai bot reviewed Feb 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Harden warm-recall degraded fallback and cache consistency#257

Harden warm-recall degraded fallback and cache consistency#257
vsumner wants to merge 2 commits intospacedriveapp:mainfrom
vsumner:fix/warm-recall-fallback-hardening

vsumner commented Feb 28, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 28, 2026 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vsumner commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Approach

1) Runtime warm-cache coordination

2) Hybrid fallback behavior

3) Delete path consistency/cancellation safety

4) Warmup refresh race hardening

5) Wiring

Why this matches architecture

Files changed

Verification

Review closure

Uh oh!

coderabbitai bot commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vsumner commented Feb 28, 2026 •

edited

Loading

coderabbitai bot commented Feb 28, 2026 •

edited

Loading