Skip to content

Harden warm-recall degraded fallback and cache consistency#257

Open
vsumner wants to merge 2 commits intospacedriveapp:mainfrom
vsumner:fix/warm-recall-fallback-hardening
Open

Harden warm-recall degraded fallback and cache consistency#257
vsumner wants to merge 2 commits intospacedriveapp:mainfrom
vsumner:fix/warm-recall-fallback-hardening

Conversation

@vsumner
Copy link
Contributor

@vsumner vsumner commented Feb 28, 2026

Problem

Warm recall fallback had a few correctness/resiliency gaps under degraded search/store conditions:

  • hybrid fallback could fail with the same store-layer failure mode it was meant to survive
  • forgotten memories could leak from warm cache in failure paths
  • concurrent warmup refresh/delete could race and reintroduce stale entries
  • warmup status could still report warm while warm-recall refresh had failed

Approach

This PR hardens warm-recall behavior with runtime-coordinated cache state and explicit failure signaling.

1) Runtime warm-cache coordination

  • added runtime warm-cache state:
    • warm_recall_memories
    • warm_recall_refreshed_at_unix_ms
    • warm_recall_cache_lock
    • warm_recall_cache_epoch
    • warm_recall_inflight_forget_counts (refcounted in-flight forget IDs)

2) Hybrid fallback behavior

  • memory_recall now:
    • uses warm cache as degraded fallback when hybrid search errors
    • reads warm cache under coordination lock
    • excludes in-flight forget IDs from warm fallback scoring
    • returns explicit degraded metadata (degraded_fallback_used, degraded_fallback_error)

3) Delete path consistency/cancellation safety

  • memory_delete now:
    • evicts warm cache entries even when memory is already forgotten
    • uses refcounted InflightForgetGuard to exclude IDs while forget is in progress
    • runs forget+evict flow in a spawned task so caller cancellation cannot split DB state and cache state
    • bumps warm-cache epoch on cache mutations

4) Warmup refresh race hardening

  • cortex warm-recall refresh now:
    • runs search outside warm-cache lock
    • applies results only if epoch is unchanged (skip apply on concurrent mutation)
    • records refresh failures into warmup errors so status correctly degrades

5) Wiring

  • branch/cortex-chat tool servers now construct runtime-aware MemoryRecallTool / MemoryDeleteTool.

Why this matches architecture

  • keeps recall fallback branch-scoped and in-process (no channel blocking)
  • preserves tool error-as-result behavior while improving degraded-mode correctness
  • does not change DB schema or migrations

Files changed

  • src/config.rs
  • src/tools/memory_recall.rs
  • src/tools/memory_delete.rs
  • src/agent/cortex.rs
  • src/tools.rs
  • src/main.rs
  • src/api/agents.rs

Verification

Targeted:

  • cargo test --lib tools::memory_recall
  • cargo test --lib tools::memory_delete
  • cargo test --lib agent::cortex

Repo gates:

  • just preflight
  • just gate-pr

All passed.

Review closure

Addressed external review findings around:

  • DB dependency in warm fallback path
  • stale forgotten-memory leakage from warm cache
  • delete/refresh race windows
  • cancellation and concurrent same-ID forget correctness

Final review rounds:

  • Rust correctness reviewer: PASS
  • Quality reviewer: PASS

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

Walkthrough

Adds a runtime warm-recall memory cache and lifecycle: new RuntimeConfig fields, warm-recall refresh and eviction logic, warmup telemetry reporting warm_recall_count, and threads RuntimeConfig into memory tools and tool-server creation for runtime-aware memory operations.

Changes

Cohort / File(s) Summary
Runtime config / cache state
src/config.rs
Added five RuntimeConfig fields for warm-recall cache state and initialized them in RuntimeConfig::new (warm_recall_memories, warm_recall_refreshed_at_unix_ms, warm_recall_cache_lock, warm_recall_inflight_forget_counts, warm_recall_cache_epoch).
Agent warmup integration
src/agent/cortex.rs
Added refresh_warm_recall_memories helper, integrated it into run_warmup_once, and included warm_recall_count in warmup success/failure telemetry and error handling.
Tool-server wiring / call sites
src/api/agents.rs, src/main.rs, src/tools.rs
Threaded runtime_config: Arc<RuntimeConfig> into create_cortex_chat_tool_server and create_branch_tool_server signatures and call sites; updated tool server builders to pass runtime_config.
Memory recall (warm-cache + hybrid fallback)
src/tools/memory_recall.rs
Added optional runtime_config field and with_runtime constructor; implemented warm-cache fetch/scoring helpers, hybrid fallback path using warm cache, and new output fields degraded_fallback_used/degraded_fallback_error.
Memory delete (eviction & inflight guards)
src/tools/memory_delete.rs
Added optional runtime_config field and with_runtime constructor; implemented evict_from_warm_cache, InflightForgetGuard, helpers for removing cache entries, and telemetry removed_from_warm_cache; includes tests for eviction and inflight logic.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested reviewers

  • jamiepine
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Harden warm-recall degraded fallback and cache consistency' directly and accurately summarizes the main objective of the PR, which focuses on hardening warm-recall behavior and improving cache consistency under degraded conditions.
Description check ✅ Passed The description is comprehensive and directly related to the changeset, detailing the problems addressed, approach taken, implementation across affected components, and verification steps performed.
Docstring Coverage ✅ Passed Docstring coverage is 86.96% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@vsumner vsumner changed the title Harden warm recall fallback and memory delete cache consistency Harden warm-recall degraded fallback and cache consistency Feb 28, 2026
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools.rs`:
- Around line 468-473: Replace the non-runtime-aware memory recall
instantiation: change the call to MemoryRecallTool::new(memory_search.clone())
so it uses the runtime-aware constructor with the existing runtime_config; use
MemoryRecallTool::with_runtime(memory_search, runtime_config.clone()) (matching
how MemoryDeleteTool is created) so memory recall honors warm-cache degraded
fallback behavior. Ensure you pass the correct cloned/owned memory_search and a
cloned runtime_config to the with_runtime call.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2188746 and 87e7ea9.

📒 Files selected for processing (7)
  • src/agent/cortex.rs
  • src/api/agents.rs
  • src/config.rs
  • src/main.rs
  • src/tools.rs
  • src/tools/memory_delete.rs
  • src/tools/memory_recall.rs

@vsumner vsumner force-pushed the fix/warm-recall-fallback-hardening branch from 11de638 to 820e2ee Compare February 28, 2026 02:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/tools/memory_recall.rs (1)

44-77: Consider releasing the cache lock before scoring to reduce contention.

The warm_recall_cache_lock is held while score_warm_memories iterates and scores all warm memories (lines 70-76). Since warm_memories is already an Arc snapshot and inflight_forget_ids is already collected, the scoring could run outside the lock:

♻️ Suggested refactor to minimize lock duration
     async fn warm_cache_results(
         &self,
         query: &str,
         memory_type: Option<MemoryType>,
         max_results: usize,
     ) -> Vec<MemorySearchResult> {
         let Some(runtime_config) = self.runtime_config.as_ref() else {
             return Vec::new();
         };
-        let _warm_recall_cache_guard = runtime_config.warm_recall_cache_lock.lock().await;
 
-        let now_unix_ms = chrono::Utc::now().timestamp_millis();
-        let refreshed_at_unix_ms = *runtime_config
-            .warm_recall_refreshed_at_unix_ms
-            .load()
-            .as_ref();
-        let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else {
-            return Vec::new();
-        };
-        let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1);
-        if age_secs > warmup_refresh_secs {
-            return Vec::new();
-        }
+        let (warm_memories, inflight_forget_ids) = {
+            let _guard = runtime_config.warm_recall_cache_lock.lock().await;
 
-        let warm_memories = runtime_config.warm_recall_memories.load();
-        let inflight_forget_ids = snapshot_inflight_forget_ids(runtime_config);
+            let now_unix_ms = chrono::Utc::now().timestamp_millis();
+            let refreshed_at_unix_ms = *runtime_config
+                .warm_recall_refreshed_at_unix_ms
+                .load()
+                .as_ref();
+            let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else {
+                return Vec::new();
+            };
+            let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1);
+            if age_secs > warmup_refresh_secs {
+                return Vec::new();
+            }
+
+            (
+                runtime_config.warm_recall_memories.load(),
+                snapshot_inflight_forget_ids(runtime_config),
+            )
+        };
+
         score_warm_memories(
             query,
             warm_memories.as_ref(),
             memory_type,
             max_results,
             &inflight_forget_ids,
         )
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/memory_recall.rs` around lines 44 - 77, The cache lock in
warm_cache_results is held while calling score_warm_memories, causing
unnecessary contention; instead, capture the needed snapshot under the lock
(runtime_config.warm_recall_memories.load() into warm_memories and compute
inflight_forget_ids via snapshot_inflight_forget_ids(runtime_config)), then drop
the lock before calling score_warm_memories(query, warm_memories.as_ref(),
memory_type, max_results, &inflight_forget_ids). Update warm_cache_results to
acquire the warm_recall_cache_lock only for snapshotting and release it prior to
scoring so scoring runs on the copied warm_memories and inflight_forget_ids
without holding runtime_config.warm_recall_cache_lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/tools/memory_delete.rs`:
- Around line 250-261: When store.forget() returns false (a concurrent-forget
race) treat it as already-forgotten rather than a hard failure: still acquire or
respect the warm_recall_cache_lock and call
MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and bump
runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so
degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped
where appropriate (or dropped after evicting) to avoid deadlocks. Locate the
logic around was_forgotten, inflight_forget_guard, runtime_config and
MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the
similar block at lines ~301-305) and change the false-path to perform eviction
and epoch increment instead of treating it as a hard error.

---

Nitpick comments:
In `@src/tools/memory_recall.rs`:
- Around line 44-77: The cache lock in warm_cache_results is held while calling
score_warm_memories, causing unnecessary contention; instead, capture the needed
snapshot under the lock (runtime_config.warm_recall_memories.load() into
warm_memories and compute inflight_forget_ids via
snapshot_inflight_forget_ids(runtime_config)), then drop the lock before calling
score_warm_memories(query, warm_memories.as_ref(), memory_type, max_results,
&inflight_forget_ids). Update warm_cache_results to acquire the
warm_recall_cache_lock only for snapshotting and release it prior to scoring so
scoring runs on the copied warm_memories and inflight_forget_ids without holding
runtime_config.warm_recall_cache_lock.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11de638 and 820e2ee.

📒 Files selected for processing (7)
  • src/agent/cortex.rs
  • src/api/agents.rs
  • src/config.rs
  • src/main.rs
  • src/tools.rs
  • src/tools/memory_delete.rs
  • src/tools/memory_recall.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/api/agents.rs

Comment on lines +250 to +261
let mut removed_from_warm_cache = false;
if was_forgotten {
let _warm_recall_cache_guard =
runtime_config.warm_recall_cache_lock.lock().await;
removed_from_warm_cache =
MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
runtime_config
.warm_recall_cache_epoch
.fetch_add(1, Ordering::SeqCst);
} else {
drop(inflight_forget_guard);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Handle concurrent-forget races as already-forgotten and still evict warm cache.

If store.forget() returns false after the earlier successful load, this is often a concurrent-forget race. The current path reports a hard failure and skips warm-cache eviction, which can leave stale degraded-fallback entries in racey paths.

🔧 Proposed adjustment
                     let mut removed_from_warm_cache = false;
                     if was_forgotten {
                         let _warm_recall_cache_guard =
                             runtime_config.warm_recall_cache_lock.lock().await;
                         removed_from_warm_cache =
                             MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
                         runtime_config
                             .warm_recall_cache_epoch
                             .fetch_add(1, Ordering::SeqCst);
                     } else {
+                        let _warm_recall_cache_guard =
+                            runtime_config.warm_recall_cache_lock.lock().await;
+                        removed_from_warm_cache =
+                            MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
+                        if removed_from_warm_cache {
+                            runtime_config
+                                .warm_recall_cache_epoch
+                                .fetch_add(1, Ordering::SeqCst);
+                        }
                         drop(inflight_forget_guard);
                     }
@@
-        } else {
+        } else {
             Ok(MemoryDeleteOutput {
                 forgotten: false,
-                message: format!("Failed to forget memory {}.", args.memory_id),
+                message: format!(
+                    "Memory {} was not newly forgotten (likely already forgotten). Warm cache evicted: {}.",
+                    args.memory_id, removed_from_warm_cache
+                ),
             })
         }

Also applies to: 301-305

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/tools/memory_delete.rs` around lines 250 - 261, When store.forget()
returns false (a concurrent-forget race) treat it as already-forgotten rather
than a hard failure: still acquire or respect the warm_recall_cache_lock and
call MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and
bump runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so
degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped
where appropriate (or dropped after evicting) to avoid deadlocks. Locate the
logic around was_forgotten, inflight_forget_guard, runtime_config and
MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the
similar block at lines ~301-305) and change the false-path to perform eviction
and epoch increment instead of treating it as a hard error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant