Skip to content

Comments

feat: resolve agent result cache (#200)#202

Merged
NoopDog merged 6 commits intomainfrom
noopdog/200/resolve-cache
Feb 22, 2026
Merged

feat: resolve agent result cache (#200)#202
NoopDog merged 6 commits intomainfrom
noopdog/200/resolve-cache

Conversation

@NoopDog
Copy link
Contributor

@NoopDog NoopDog commented Feb 22, 2026

Summary

  • In-memory LRU cache for resolve agent results — eliminates redundant LLM calls when the same (facet, text) pair is resolved multiple times. Cache hits return in <1ms vs 2-8s for an LLM call.
  • Fix resolve eval scoring — switched from recall-only to F1 (penalizes both missing and spurious values). Updated consent code expectations to use a dynamic _consent_expected() helper that stays in sync with catalog data. Fixed measurement/focus expectations from actual agent output.
  • Set temperature=0.0 on extract and resolve agents for deterministic outputs (structure agent was already at 0.0).

Cache details

  • LRU eviction at 10,000 entries (configurable via RESOLVE_CACHE_MAX_SIZE)
  • 24h TTL (configurable via RESOLVE_CACHE_TTL_SECONDS)
  • In-flight deduplication — concurrent resolves for the same key share a single LLM call via asyncio.Event
  • Stats exposed in /health endpoint (resolveCache: {hits, misses, hit_rate, size})
  • Clears on process restart (make restart / make db-reload)

Test plan

  • 119 unit tests pass (9 new cache tests + 110 existing)
  • Extract evals: 0.976 (baseline)
  • Resolve evals: 0.979 F1 (42/44 at 1.0)

🤖 Generated with Claude Code

NoopDog and others added 3 commits February 22, 2026 00:05
The resolve eval was using recall-only scoring, which didn't penalize
spurious extra values. Switch to F1 (harmonic mean of precision and
recall) so both missing and extra values are caught.

Update expected outputs to match the agent's correct behavior:
- Consent code cases now use dynamic expectations computed via
  compute_eligible_codes, so they auto-update with catalog data
- Measurement/focus cases updated to include the full set of
  related concepts the agent returns (e.g. "blood pressure" →
  Systolic + Diastolic)
- Remove duplicate eval cases (same input, different expectations)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Deterministic outputs are needed for the resolve cache (same input →
same result). Extract agent had no explicit temperature; resolve agent
was at 0.2. Both now pinned to 0.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache resolved mentions by (facet, normalized_text) to eliminate
redundant LLM calls for repeated queries. Features:
- LRU eviction at 10k entries (configurable via RESOLVE_CACHE_MAX_SIZE)
- 24h TTL (configurable via RESOLVE_CACHE_TTL_SECONDS)
- In-flight deduplication via asyncio.Event — concurrent resolves for
  the same key share a single LLM call
- Cache stats exposed in /health endpoint (resolveCache)
- Clears on process restart (make restart / make db-reload)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an in-memory LRU cache for the resolve agent to eliminate redundant LLM calls, switches eval scoring from recall-only to F1 metric, and sets temperature=0.0 on all agents for deterministic outputs.

Changes:

  • Added LRU cache with TTL and in-flight deduplication for resolve agent results, reducing redundant LLM calls from 2-8s to <1ms for cache hits
  • Changed resolve eval scoring from recall-only to F1 to penalize both missing and spurious values, and updated test expectations to use dynamic consent code computation
  • Set temperature=0.0 on extract and resolve agents for deterministic outputs (structure agent already had this setting)

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
backend/concept_search/resolve_agent.py Implements LRU cache with TTL, in-flight deduplication, stats tracking, and refactors run_resolve to use caching
backend/tests/test_resolve_cache.py Comprehensive test suite covering cache hits, misses, TTL, LRU eviction, normalization, deduplication, and stats
backend/concept_search/extract_agent.py Sets temperature=0.0 for deterministic outputs
backend/concept_search/eval_resolve.py Changes scoring to F1, adds dynamic consent code expectations via _consent_expected helper, updates measurement/focus test expectations
backend/concept_search/api.py Exposes cache stats in /health endpoint
Comments suppressed due to low confidence (3)

backend/concept_search/resolve_agent.py:96

  • Memory leak potential: Expired cache entries are only removed when they are accessed (line 91 check) or when the cache reaches max_size (line 136-138). This means expired entries can accumulate and consume memory until they are either accessed or evicted by LRU. Consider implementing a periodic cleanup task or proactive expiration during cache operations to prevent unbounded memory growth when many unique keys are cached and then abandoned.
        async with self._lock:
            # 1. Cache hit?
            entry = self._cache.get(key)
            if entry and (time.monotonic() - entry.created) < self.ttl_seconds:
                self.hits += 1
                # Move to end for LRU ordering
                self._cache[key] = self._cache.pop(key)
                logger.info("resolve_cache hit key=%s", key)
                return entry.result

backend/concept_search/resolve_agent.py:167

  • Race condition: The stats property reads hits, misses, and len(self._cache) without holding the lock. In a multi-threaded environment, these values could be inconsistent (e.g., hits and misses could change between reads). Consider acquiring the lock when reading these values to ensure atomic reads, or use atomic operations.
    @property
    def stats(self) -> dict:
        """Return cache statistics."""
        total = self.hits + self.misses
        return {
            "hit_rate": round(self.hits / total, 3) if total else 0,
            "hits": self.hits,
            "misses": self.misses,
            "size": len(self._cache),
        }

backend/concept_search/resolve_agent.py:143

  • Error handling issue: If _run_resolve_uncached raises an exception, the result is never cached. However, the in-flight event is properly cleaned up in the finally block. Consider whether certain types of errors (e.g., transient LLM errors vs permanent validation errors) should be cached differently to avoid repeated failed attempts for the same mention.
        try:
            result = await _run_resolve_uncached(mention, index, model)
        finally:
            async with self._lock:
                ev = self._in_flight.pop(key, None)
                if ev is not None:
                    ev.set()

        # Store result
        async with self._lock:
            if len(self._cache) >= self.max_size:
                oldest = next(iter(self._cache))
                del self._cache[oldest]
            self._cache[key] = _CacheEntry(
                created=time.monotonic(), result=result
            )

        return result

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Extract shared LRUCache[K, V] into cache.py — used by both resolve
  and pipeline caches, eliminating duplication
- Add pipeline-level cache keyed on normalized query string — repeated
  queries skip all three agents (3.6s → 8ms)
- clear_all() clears every registered cache instance at once
- Both caches report stats in /health (pipelineCache, resolveCache)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 91 to 93
self.misses += 1
logger.info("%s miss key=%s", self.name, key)
try:
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.info("%s miss key=%s", ...) has the same issue as the hit log: it records the raw cache key (including user query/mention text) at INFO level, which is both sensitive and high-volume. Consider removing it, switching to DEBUG, and/or logging only anonymized key info.

Copilot uses AI. Check for mistakes.
Comment on lines 54 to 56
key: K,
compute: asyncio.coroutines,
) -> V:
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compute parameter type annotation is incorrect: asyncio.coroutines is not a useful callable/awaitable type here and won’t type-check. Consider typing this as an async callable (e.g., Callable[[], Awaitable[V]]) so callers and static type checkers have the right contract.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +251
key = query.strip().lower()
return await pipeline_cache.get_or_compute(
key, lambda: _run_pipeline_uncached(query, index, model)
)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_pipeline caches solely by normalized query text, but the function signature allows overriding index and model. If either differs between calls, this cache can return results produced with a different model or against a different index than requested. Consider including model (and some stable index identifier/version) in the cache key, or bypass caching when overrides are provided.

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +284
key = (mention.facet.value, mention.text.strip().lower())
return await resolve_cache.get_or_compute(
key, lambda: _run_resolve_uncached(mention, index, model)
)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run_resolve caches by (facet, normalized_text) but ignores the model argument and the index dependency. If callers override model (or pass a different ConceptIndex), this can return cached results computed with different inputs than requested. Consider incorporating model (and optionally an index version/id) into the key, or disabling caching when overrides are used.

Copilot uses AI. Check for mistakes.
Comment on lines 93 to 108
try:
value = await compute()
finally:
async with self._lock:
ev = self._in_flight.pop(key, None)
if ev is not None:
ev.set()

async with self._lock:
if len(self._cache) >= self.max_size:
oldest = next(iter(self._cache))
del self._cache[oldest]
self._cache[key] = _CacheEntry(
created=time.monotonic(), value=value
)

Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In-flight deduplication is currently racy: the in-flight Event is set in the finally block before the computed value is inserted into _cache. Waiters can wake up, not find the entry yet, and then proceed to compute again (breaking the “run only once” guarantee and potentially causing stampedes). Consider storing the result (or exception) and updating _cache while still holding the lock, and only then setting the event (or use an asyncio.Future/Task per key to propagate result/exception).

Suggested change
try:
value = await compute()
finally:
async with self._lock:
ev = self._in_flight.pop(key, None)
if ev is not None:
ev.set()
async with self._lock:
if len(self._cache) >= self.max_size:
oldest = next(iter(self._cache))
del self._cache[oldest]
self._cache[key] = _CacheEntry(
created=time.monotonic(), value=value
)
success = False
try:
value = await compute()
success = True
finally:
async with self._lock:
if success:
if len(self._cache) >= self.max_size:
oldest = next(iter(self._cache))
del self._cache[oldest]
self._cache[key] = _CacheEntry(
created=time.monotonic(), value=value
)
ev = self._in_flight.pop(key, None)
if ev is not None:
ev.set()

Copilot uses AI. Check for mistakes.
Comment on lines 69 to 72
self.hits += 1
self._cache[key] = self._cache.pop(key)
logger.info("%s hit key=%s", self.name, key)
return entry.value
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logger.info("%s hit key=%s", ...) logs the full cache key, which for pipeline_cache/resolve_cache includes user-provided query/mention text. Since logs go to CloudWatch at INFO level, this can leak potentially sensitive user input and will also be very noisy in a hot path. Consider removing this log, downgrading to DEBUG, and/or logging only non-sensitive metadata (e.g., cache name + a hash of the key).

Copilot uses AI. Check for mistakes.
NoopDog and others added 2 commits February 22, 2026 00:59
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Fix race: store result in cache before setting in-flight event, so
  waiters find the entry immediately on wakeup
- Add TTL check on waiter path to avoid returning expired entries
- Downgrade cache hit/miss logs from INFO to DEBUG (hot-path noise)
- Add exception handling tests: failed computes are not cached, and
  retries after failure work correctly

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

backend/tests/test_cache.py:97

  • The lambda closures lambda: _compute(k) in the loop capture the loop variable k by reference. While this works correctly in the current implementation because the lambda is called immediately upon cache miss (before the next iteration), this pattern is fragile and could lead to bugs if the cache implementation changes.

Consider using a default argument to capture the value: lambda k=k: _compute(k) or using a helper function that doesn't rely on closure over a loop variable. This makes the intent clearer and prevents potential future bugs.

    for k in ["a", "b", "c"]:
        await cache.get_or_compute(k, lambda: _compute(k))
    assert len(cache._cache) == 3

    await cache.get_or_compute("d", lambda: _compute("d"))
    assert len(cache._cache) == 3
    assert "a" not in cache._cache


@pytest.mark.asyncio()
async def test_lru_access_refreshes(cache: LRUCache[str, str]) -> None:
    """Accessing an entry should move it to the end (most recent)."""
    for k in ["a", "b", "c"]:
        await cache.get_or_compute(k, lambda: _compute(k))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +281 to +284
key = (mention.facet.value, mention.text.strip().lower())
return await resolve_cache.get_or_compute(
key, lambda: _run_resolve_uncached(mention, index, model)
)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key for resolve only includes (facet, normalized_text), but doesn't include the model parameter. This means that if the same mention is resolved with different models, the second call will return cached results from the first model instead of actually using the specified model.

While the API and typical usage may not vary the model parameter, the function signature accepts it, and the resolve agent is called from the pipeline which also accepts a model parameter. This could lead to unexpected behavior if different models are used for the same mention.

Consider including the model in the cache key, or documenting that the cache is model-agnostic.

Copilot uses AI. Check for mistakes.
Comment on lines +126 to +135
@property
def stats(self) -> dict:
"""Return cache statistics."""
total = self.hits + self.misses
return {
"hit_rate": round(self.hits / total, 3) if total else 0,
"hits": self.hits,
"misses": self.misses,
"size": len(self._cache),
}
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stats property reads self.hits, self.misses, and len(self._cache) without acquiring self._lock. This could lead to race conditions where the stats are inconsistent or stale.

For example, if one coroutine is updating the cache while another calls stats, the hit rate calculation could be based on mismatched values of hits and misses, or the size could be inconsistent with the hit/miss counts.

Consider either:

  1. Acquiring the lock in the stats property: async with self._lock: return {...}
  2. Documenting that stats are eventually consistent and may be slightly stale
  3. Using atomic operations or a separate lock for stats if performance is a concern

Note that making this an async property would require changing all call sites to await it.

Copilot uses AI. Check for mistakes.
Comment on lines +248 to +250
key = query.strip().lower()
return await pipeline_cache.get_or_compute(
key, lambda: _run_pipeline_uncached(query, index, model)
Copy link

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache key for the pipeline only includes the normalized query string, but doesn't include the model parameter. This means that if the same query is run with different models (e.g., via the CLI with --model), the second call will return cached results from the first model instead of actually using the specified model.

For example, if a user runs python -m concept_search.cli "diabetes" --model anthropic:claude-sonnet-4 and then runs python -m concept_search.cli "diabetes" --model anthropic:claude-haiku-4, the second call will return the cached result from Sonnet instead of re-running with Haiku.

Consider including the model in the cache key, or documenting that the cache is model-agnostic (though this seems unlikely to be the intended behavior).

Copilot uses AI. Check for mistakes.
@NoopDog NoopDog merged commit 604196d into main Feb 22, 2026
5 checks passed
@NoopDog NoopDog deleted the noopdog/200/resolve-cache branch February 22, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant