Harden warm-recall degraded fallback and cache consistency#257
Harden warm-recall degraded fallback and cache consistency#257vsumner wants to merge 2 commits intospacedriveapp:mainfrom
Conversation
WalkthroughAdds a runtime warm-recall memory cache and lifecycle: new RuntimeConfig fields, warm-recall refresh and eviction logic, warmup telemetry reporting warm_recall_count, and threads RuntimeConfig into memory tools and tool-server creation for runtime-aware memory operations. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tools.rs`:
- Around line 468-473: Replace the non-runtime-aware memory recall
instantiation: change the call to MemoryRecallTool::new(memory_search.clone())
so it uses the runtime-aware constructor with the existing runtime_config; use
MemoryRecallTool::with_runtime(memory_search, runtime_config.clone()) (matching
how MemoryDeleteTool is created) so memory recall honors warm-cache degraded
fallback behavior. Ensure you pass the correct cloned/owned memory_search and a
cloned runtime_config to the with_runtime call.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/agent/cortex.rssrc/api/agents.rssrc/config.rssrc/main.rssrc/tools.rssrc/tools/memory_delete.rssrc/tools/memory_recall.rs
11de638 to
820e2ee
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/tools/memory_recall.rs (1)
44-77: Consider releasing the cache lock before scoring to reduce contention.The
warm_recall_cache_lockis held whilescore_warm_memoriesiterates and scores all warm memories (lines 70-76). Sincewarm_memoriesis already anArcsnapshot andinflight_forget_idsis already collected, the scoring could run outside the lock:♻️ Suggested refactor to minimize lock duration
async fn warm_cache_results( &self, query: &str, memory_type: Option<MemoryType>, max_results: usize, ) -> Vec<MemorySearchResult> { let Some(runtime_config) = self.runtime_config.as_ref() else { return Vec::new(); }; - let _warm_recall_cache_guard = runtime_config.warm_recall_cache_lock.lock().await; - let now_unix_ms = chrono::Utc::now().timestamp_millis(); - let refreshed_at_unix_ms = *runtime_config - .warm_recall_refreshed_at_unix_ms - .load() - .as_ref(); - let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else { - return Vec::new(); - }; - let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1); - if age_secs > warmup_refresh_secs { - return Vec::new(); - } + let (warm_memories, inflight_forget_ids) = { + let _guard = runtime_config.warm_recall_cache_lock.lock().await; - let warm_memories = runtime_config.warm_recall_memories.load(); - let inflight_forget_ids = snapshot_inflight_forget_ids(runtime_config); + let now_unix_ms = chrono::Utc::now().timestamp_millis(); + let refreshed_at_unix_ms = *runtime_config + .warm_recall_refreshed_at_unix_ms + .load() + .as_ref(); + let Some(age_secs) = warm_cache_age_secs(refreshed_at_unix_ms, now_unix_ms) else { + return Vec::new(); + }; + let warmup_refresh_secs = runtime_config.warmup.load().as_ref().refresh_secs.max(1); + if age_secs > warmup_refresh_secs { + return Vec::new(); + } + + ( + runtime_config.warm_recall_memories.load(), + snapshot_inflight_forget_ids(runtime_config), + ) + }; + score_warm_memories( query, warm_memories.as_ref(), memory_type, max_results, &inflight_forget_ids, ) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/tools/memory_recall.rs` around lines 44 - 77, The cache lock in warm_cache_results is held while calling score_warm_memories, causing unnecessary contention; instead, capture the needed snapshot under the lock (runtime_config.warm_recall_memories.load() into warm_memories and compute inflight_forget_ids via snapshot_inflight_forget_ids(runtime_config)), then drop the lock before calling score_warm_memories(query, warm_memories.as_ref(), memory_type, max_results, &inflight_forget_ids). Update warm_cache_results to acquire the warm_recall_cache_lock only for snapshotting and release it prior to scoring so scoring runs on the copied warm_memories and inflight_forget_ids without holding runtime_config.warm_recall_cache_lock.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/tools/memory_delete.rs`:
- Around line 250-261: When store.forget() returns false (a concurrent-forget
race) treat it as already-forgotten rather than a hard failure: still acquire or
respect the warm_recall_cache_lock and call
MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and bump
runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so
degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped
where appropriate (or dropped after evicting) to avoid deadlocks. Locate the
logic around was_forgotten, inflight_forget_guard, runtime_config and
MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the
similar block at lines ~301-305) and change the false-path to perform eviction
and epoch increment instead of treating it as a hard error.
---
Nitpick comments:
In `@src/tools/memory_recall.rs`:
- Around line 44-77: The cache lock in warm_cache_results is held while calling
score_warm_memories, causing unnecessary contention; instead, capture the needed
snapshot under the lock (runtime_config.warm_recall_memories.load() into
warm_memories and compute inflight_forget_ids via
snapshot_inflight_forget_ids(runtime_config)), then drop the lock before calling
score_warm_memories(query, warm_memories.as_ref(), memory_type, max_results,
&inflight_forget_ids). Update warm_cache_results to acquire the
warm_recall_cache_lock only for snapshotting and release it prior to scoring so
scoring runs on the copied warm_memories and inflight_forget_ids without holding
runtime_config.warm_recall_cache_lock.
ℹ️ Review info
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
src/agent/cortex.rssrc/api/agents.rssrc/config.rssrc/main.rssrc/tools.rssrc/tools/memory_delete.rssrc/tools/memory_recall.rs
🚧 Files skipped from review as they are similar to previous changes (1)
- src/api/agents.rs
| let mut removed_from_warm_cache = false; | ||
| if was_forgotten { | ||
| let _warm_recall_cache_guard = | ||
| runtime_config.warm_recall_cache_lock.lock().await; | ||
| removed_from_warm_cache = | ||
| MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id); | ||
| runtime_config | ||
| .warm_recall_cache_epoch | ||
| .fetch_add(1, Ordering::SeqCst); | ||
| } else { | ||
| drop(inflight_forget_guard); | ||
| } |
There was a problem hiding this comment.
Handle concurrent-forget races as already-forgotten and still evict warm cache.
If store.forget() returns false after the earlier successful load, this is often a concurrent-forget race. The current path reports a hard failure and skips warm-cache eviction, which can leave stale degraded-fallback entries in racey paths.
🔧 Proposed adjustment
let mut removed_from_warm_cache = false;
if was_forgotten {
let _warm_recall_cache_guard =
runtime_config.warm_recall_cache_lock.lock().await;
removed_from_warm_cache =
MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
runtime_config
.warm_recall_cache_epoch
.fetch_add(1, Ordering::SeqCst);
} else {
+ let _warm_recall_cache_guard =
+ runtime_config.warm_recall_cache_lock.lock().await;
+ removed_from_warm_cache =
+ MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id);
+ if removed_from_warm_cache {
+ runtime_config
+ .warm_recall_cache_epoch
+ .fetch_add(1, Ordering::SeqCst);
+ }
drop(inflight_forget_guard);
}
@@
- } else {
+ } else {
Ok(MemoryDeleteOutput {
forgotten: false,
- message: format!("Failed to forget memory {}.", args.memory_id),
+ message: format!(
+ "Memory {} was not newly forgotten (likely already forgotten). Warm cache evicted: {}.",
+ args.memory_id, removed_from_warm_cache
+ ),
})
}Also applies to: 301-305
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/tools/memory_delete.rs` around lines 250 - 261, When store.forget()
returns false (a concurrent-forget race) treat it as already-forgotten rather
than a hard failure: still acquire or respect the warm_recall_cache_lock and
call MemoryDeleteTool::evict_from_warm_cache(&runtime_config, &memory_id) and
bump runtime_config.warm_recall_cache_epoch.fetch_add(1, Ordering::SeqCst) so
degraded/fallback entries are evicted; ensure inflight_forget_guard is dropped
where appropriate (or dropped after evicting) to avoid deadlocks. Locate the
logic around was_forgotten, inflight_forget_guard, runtime_config and
MemoryDeleteTool::evict_from_warm_cache (also mirror the same change in the
similar block at lines ~301-305) and change the false-path to perform eviction
and epoch increment instead of treating it as a hard error.
Problem
Warm recall fallback had a few correctness/resiliency gaps under degraded search/store conditions:
Approach
This PR hardens warm-recall behavior with runtime-coordinated cache state and explicit failure signaling.
1) Runtime warm-cache coordination
warm_recall_memorieswarm_recall_refreshed_at_unix_mswarm_recall_cache_lockwarm_recall_cache_epochwarm_recall_inflight_forget_counts(refcounted in-flight forget IDs)2) Hybrid fallback behavior
memory_recallnow:degraded_fallback_used,degraded_fallback_error)3) Delete path consistency/cancellation safety
memory_deletenow:forgottenInflightForgetGuardto exclude IDs while forget is in progress4) Warmup refresh race hardening
5) Wiring
MemoryRecallTool/MemoryDeleteTool.Why this matches architecture
Files changed
src/config.rssrc/tools/memory_recall.rssrc/tools/memory_delete.rssrc/agent/cortex.rssrc/tools.rssrc/main.rssrc/api/agents.rsVerification
Targeted:
cargo test --lib tools::memory_recallcargo test --lib tools::memory_deletecargo test --lib agent::cortexRepo gates:
just preflightjust gate-prAll passed.
Review closure
Addressed external review findings around:
Final review rounds: