Conversation
The resolve eval was using recall-only scoring, which didn't penalize spurious extra values. Switch to F1 (harmonic mean of precision and recall) so both missing and extra values are caught. Update expected outputs to match the agent's correct behavior: - Consent code cases now use dynamic expectations computed via compute_eligible_codes, so they auto-update with catalog data - Measurement/focus cases updated to include the full set of related concepts the agent returns (e.g. "blood pressure" → Systolic + Diastolic) - Remove duplicate eval cases (same input, different expectations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deterministic outputs are needed for the resolve cache (same input → same result). Extract agent had no explicit temperature; resolve agent was at 0.2. Both now pinned to 0.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache resolved mentions by (facet, normalized_text) to eliminate redundant LLM calls for repeated queries. Features: - LRU eviction at 10k entries (configurable via RESOLVE_CACHE_MAX_SIZE) - 24h TTL (configurable via RESOLVE_CACHE_TTL_SECONDS) - In-flight deduplication via asyncio.Event — concurrent resolves for the same key share a single LLM call - Cache stats exposed in /health endpoint (resolveCache) - Clears on process restart (make restart / make db-reload) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR implements an in-memory LRU cache for the resolve agent to eliminate redundant LLM calls, switches eval scoring from recall-only to F1 metric, and sets temperature=0.0 on all agents for deterministic outputs.
Changes:
- Added LRU cache with TTL and in-flight deduplication for resolve agent results, reducing redundant LLM calls from 2-8s to <1ms for cache hits
- Changed resolve eval scoring from recall-only to F1 to penalize both missing and spurious values, and updated test expectations to use dynamic consent code computation
- Set temperature=0.0 on extract and resolve agents for deterministic outputs (structure agent already had this setting)
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| backend/concept_search/resolve_agent.py | Implements LRU cache with TTL, in-flight deduplication, stats tracking, and refactors run_resolve to use caching |
| backend/tests/test_resolve_cache.py | Comprehensive test suite covering cache hits, misses, TTL, LRU eviction, normalization, deduplication, and stats |
| backend/concept_search/extract_agent.py | Sets temperature=0.0 for deterministic outputs |
| backend/concept_search/eval_resolve.py | Changes scoring to F1, adds dynamic consent code expectations via _consent_expected helper, updates measurement/focus test expectations |
| backend/concept_search/api.py | Exposes cache stats in /health endpoint |
Comments suppressed due to low confidence (3)
backend/concept_search/resolve_agent.py:96
- Memory leak potential: Expired cache entries are only removed when they are accessed (line 91 check) or when the cache reaches max_size (line 136-138). This means expired entries can accumulate and consume memory until they are either accessed or evicted by LRU. Consider implementing a periodic cleanup task or proactive expiration during cache operations to prevent unbounded memory growth when many unique keys are cached and then abandoned.
async with self._lock:
# 1. Cache hit?
entry = self._cache.get(key)
if entry and (time.monotonic() - entry.created) < self.ttl_seconds:
self.hits += 1
# Move to end for LRU ordering
self._cache[key] = self._cache.pop(key)
logger.info("resolve_cache hit key=%s", key)
return entry.result
backend/concept_search/resolve_agent.py:167
- Race condition: The stats property reads hits, misses, and len(self._cache) without holding the lock. In a multi-threaded environment, these values could be inconsistent (e.g., hits and misses could change between reads). Consider acquiring the lock when reading these values to ensure atomic reads, or use atomic operations.
@property
def stats(self) -> dict:
"""Return cache statistics."""
total = self.hits + self.misses
return {
"hit_rate": round(self.hits / total, 3) if total else 0,
"hits": self.hits,
"misses": self.misses,
"size": len(self._cache),
}
backend/concept_search/resolve_agent.py:143
- Error handling issue: If _run_resolve_uncached raises an exception, the result is never cached. However, the in-flight event is properly cleaned up in the finally block. Consider whether certain types of errors (e.g., transient LLM errors vs permanent validation errors) should be cached differently to avoid repeated failed attempts for the same mention.
try:
result = await _run_resolve_uncached(mention, index, model)
finally:
async with self._lock:
ev = self._in_flight.pop(key, None)
if ev is not None:
ev.set()
# Store result
async with self._lock:
if len(self._cache) >= self.max_size:
oldest = next(iter(self._cache))
del self._cache[oldest]
self._cache[key] = _CacheEntry(
created=time.monotonic(), result=result
)
return result
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Extract shared LRUCache[K, V] into cache.py — used by both resolve and pipeline caches, eliminating duplication - Add pipeline-level cache keyed on normalized query string — repeated queries skip all three agents (3.6s → 8ms) - clear_all() clears every registered cache instance at once - Both caches report stats in /health (pipelineCache, resolveCache) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| self.misses += 1 | ||
| logger.info("%s miss key=%s", self.name, key) | ||
| try: |
There was a problem hiding this comment.
logger.info("%s miss key=%s", ...) has the same issue as the hit log: it records the raw cache key (including user query/mention text) at INFO level, which is both sensitive and high-volume. Consider removing it, switching to DEBUG, and/or logging only anonymized key info.
| key: K, | ||
| compute: asyncio.coroutines, | ||
| ) -> V: |
There was a problem hiding this comment.
The compute parameter type annotation is incorrect: asyncio.coroutines is not a useful callable/awaitable type here and won’t type-check. Consider typing this as an async callable (e.g., Callable[[], Awaitable[V]]) so callers and static type checkers have the right contract.
| key = query.strip().lower() | ||
| return await pipeline_cache.get_or_compute( | ||
| key, lambda: _run_pipeline_uncached(query, index, model) | ||
| ) |
There was a problem hiding this comment.
run_pipeline caches solely by normalized query text, but the function signature allows overriding index and model. If either differs between calls, this cache can return results produced with a different model or against a different index than requested. Consider including model (and some stable index identifier/version) in the cache key, or bypass caching when overrides are provided.
| key = (mention.facet.value, mention.text.strip().lower()) | ||
| return await resolve_cache.get_or_compute( | ||
| key, lambda: _run_resolve_uncached(mention, index, model) | ||
| ) |
There was a problem hiding this comment.
run_resolve caches by (facet, normalized_text) but ignores the model argument and the index dependency. If callers override model (or pass a different ConceptIndex), this can return cached results computed with different inputs than requested. Consider incorporating model (and optionally an index version/id) into the key, or disabling caching when overrides are used.
backend/concept_search/cache.py
Outdated
| try: | ||
| value = await compute() | ||
| finally: | ||
| async with self._lock: | ||
| ev = self._in_flight.pop(key, None) | ||
| if ev is not None: | ||
| ev.set() | ||
|
|
||
| async with self._lock: | ||
| if len(self._cache) >= self.max_size: | ||
| oldest = next(iter(self._cache)) | ||
| del self._cache[oldest] | ||
| self._cache[key] = _CacheEntry( | ||
| created=time.monotonic(), value=value | ||
| ) | ||
|
|
There was a problem hiding this comment.
In-flight deduplication is currently racy: the in-flight Event is set in the finally block before the computed value is inserted into _cache. Waiters can wake up, not find the entry yet, and then proceed to compute again (breaking the “run only once” guarantee and potentially causing stampedes). Consider storing the result (or exception) and updating _cache while still holding the lock, and only then setting the event (or use an asyncio.Future/Task per key to propagate result/exception).
| try: | |
| value = await compute() | |
| finally: | |
| async with self._lock: | |
| ev = self._in_flight.pop(key, None) | |
| if ev is not None: | |
| ev.set() | |
| async with self._lock: | |
| if len(self._cache) >= self.max_size: | |
| oldest = next(iter(self._cache)) | |
| del self._cache[oldest] | |
| self._cache[key] = _CacheEntry( | |
| created=time.monotonic(), value=value | |
| ) | |
| success = False | |
| try: | |
| value = await compute() | |
| success = True | |
| finally: | |
| async with self._lock: | |
| if success: | |
| if len(self._cache) >= self.max_size: | |
| oldest = next(iter(self._cache)) | |
| del self._cache[oldest] | |
| self._cache[key] = _CacheEntry( | |
| created=time.monotonic(), value=value | |
| ) | |
| ev = self._in_flight.pop(key, None) | |
| if ev is not None: | |
| ev.set() |
| self.hits += 1 | ||
| self._cache[key] = self._cache.pop(key) | ||
| logger.info("%s hit key=%s", self.name, key) | ||
| return entry.value |
There was a problem hiding this comment.
logger.info("%s hit key=%s", ...) logs the full cache key, which for pipeline_cache/resolve_cache includes user-provided query/mention text. Since logs go to CloudWatch at INFO level, this can leak potentially sensitive user input and will also be very noisy in a hot path. Consider removing this log, downgrading to DEBUG, and/or logging only non-sensitive metadata (e.g., cache name + a hash of the key).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix race: store result in cache before setting in-flight event, so waiters find the entry immediately on wakeup - Add TTL check on waiter path to avoid returning expired entries - Downgrade cache hit/miss logs from INFO to DEBUG (hot-path noise) - Add exception handling tests: failed computes are not cached, and retries after failure work correctly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
backend/tests/test_cache.py:97
- The lambda closures
lambda: _compute(k)in the loop capture the loop variablekby reference. While this works correctly in the current implementation because the lambda is called immediately upon cache miss (before the next iteration), this pattern is fragile and could lead to bugs if the cache implementation changes.
Consider using a default argument to capture the value: lambda k=k: _compute(k) or using a helper function that doesn't rely on closure over a loop variable. This makes the intent clearer and prevents potential future bugs.
for k in ["a", "b", "c"]:
await cache.get_or_compute(k, lambda: _compute(k))
assert len(cache._cache) == 3
await cache.get_or_compute("d", lambda: _compute("d"))
assert len(cache._cache) == 3
assert "a" not in cache._cache
@pytest.mark.asyncio()
async def test_lru_access_refreshes(cache: LRUCache[str, str]) -> None:
"""Accessing an entry should move it to the end (most recent)."""
for k in ["a", "b", "c"]:
await cache.get_or_compute(k, lambda: _compute(k))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| key = (mention.facet.value, mention.text.strip().lower()) | ||
| return await resolve_cache.get_or_compute( | ||
| key, lambda: _run_resolve_uncached(mention, index, model) | ||
| ) |
There was a problem hiding this comment.
The cache key for resolve only includes (facet, normalized_text), but doesn't include the model parameter. This means that if the same mention is resolved with different models, the second call will return cached results from the first model instead of actually using the specified model.
While the API and typical usage may not vary the model parameter, the function signature accepts it, and the resolve agent is called from the pipeline which also accepts a model parameter. This could lead to unexpected behavior if different models are used for the same mention.
Consider including the model in the cache key, or documenting that the cache is model-agnostic.
| @property | ||
| def stats(self) -> dict: | ||
| """Return cache statistics.""" | ||
| total = self.hits + self.misses | ||
| return { | ||
| "hit_rate": round(self.hits / total, 3) if total else 0, | ||
| "hits": self.hits, | ||
| "misses": self.misses, | ||
| "size": len(self._cache), | ||
| } |
There was a problem hiding this comment.
The stats property reads self.hits, self.misses, and len(self._cache) without acquiring self._lock. This could lead to race conditions where the stats are inconsistent or stale.
For example, if one coroutine is updating the cache while another calls stats, the hit rate calculation could be based on mismatched values of hits and misses, or the size could be inconsistent with the hit/miss counts.
Consider either:
- Acquiring the lock in the stats property:
async with self._lock: return {...} - Documenting that stats are eventually consistent and may be slightly stale
- Using atomic operations or a separate lock for stats if performance is a concern
Note that making this an async property would require changing all call sites to await it.
| key = query.strip().lower() | ||
| return await pipeline_cache.get_or_compute( | ||
| key, lambda: _run_pipeline_uncached(query, index, model) |
There was a problem hiding this comment.
The cache key for the pipeline only includes the normalized query string, but doesn't include the model parameter. This means that if the same query is run with different models (e.g., via the CLI with --model), the second call will return cached results from the first model instead of actually using the specified model.
For example, if a user runs python -m concept_search.cli "diabetes" --model anthropic:claude-sonnet-4 and then runs python -m concept_search.cli "diabetes" --model anthropic:claude-haiku-4, the second call will return the cached result from Sonnet instead of re-running with Haiku.
Consider including the model in the cache key, or documenting that the cache is model-agnostic (though this seems unlikely to be the intended behavior).
Summary
(facet, text)pair is resolved multiple times. Cache hits return in <1ms vs 2-8s for an LLM call._consent_expected()helper that stays in sync with catalog data. Fixed measurement/focus expectations from actual agent output.Cache details
RESOLVE_CACHE_MAX_SIZE)RESOLVE_CACHE_TTL_SECONDS)asyncio.Event/healthendpoint (resolveCache: {hits, misses, hit_rate, size})make restart/make db-reload)Test plan
🤖 Generated with Claude Code