feat: resolve agent result cache (#200) by NoopDog · Pull Request #202 · NIH-NCPI/ncpi-dataset-catalog

NoopDog · 2026-02-22T08:15:22Z

Summary

In-memory LRU cache for resolve agent results — eliminates redundant LLM calls when the same (facet, text) pair is resolved multiple times. Cache hits return in <1ms vs 2-8s for an LLM call.
Fix resolve eval scoring — switched from recall-only to F1 (penalizes both missing and spurious values). Updated consent code expectations to use a dynamic _consent_expected() helper that stays in sync with catalog data. Fixed measurement/focus expectations from actual agent output.
Set temperature=0.0 on extract and resolve agents for deterministic outputs (structure agent was already at 0.0).

Cache details

LRU eviction at 10,000 entries (configurable via RESOLVE_CACHE_MAX_SIZE)
24h TTL (configurable via RESOLVE_CACHE_TTL_SECONDS)
In-flight deduplication — concurrent resolves for the same key share a single LLM call via asyncio.Event
Stats exposed in /health endpoint (resolveCache: {hits, misses, hit_rate, size})
Clears on process restart (make restart / make db-reload)

Test plan

119 unit tests pass (9 new cache tests + 110 existing)
Extract evals: 0.976 (baseline)
Resolve evals: 0.979 F1 (42/44 at 1.0)

🤖 Generated with Claude Code

The resolve eval was using recall-only scoring, which didn't penalize spurious extra values. Switch to F1 (harmonic mean of precision and recall) so both missing and extra values are caught. Update expected outputs to match the agent's correct behavior: - Consent code cases now use dynamic expectations computed via compute_eligible_codes, so they auto-update with catalog data - Measurement/focus cases updated to include the full set of related concepts the agent returns (e.g. "blood pressure" → Systolic + Diastolic) - Remove duplicate eval cases (same input, different expectations) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Deterministic outputs are needed for the resolve cache (same input → same result). Extract agent had no explicit temperature; resolve agent was at 0.2. Both now pinned to 0.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Cache resolved mentions by (facet, normalized_text) to eliminate redundant LLM calls for repeated queries. Features: - LRU eviction at 10k entries (configurable via RESOLVE_CACHE_MAX_SIZE) - 24h TTL (configurable via RESOLVE_CACHE_TTL_SECONDS) - In-flight deduplication via asyncio.Event — concurrent resolves for the same key share a single LLM call - Cache stats exposed in /health endpoint (resolveCache) - Clears on process restart (make restart / make db-reload) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

This PR implements an in-memory LRU cache for the resolve agent to eliminate redundant LLM calls, switches eval scoring from recall-only to F1 metric, and sets temperature=0.0 on all agents for deterministic outputs.

Changes:

Added LRU cache with TTL and in-flight deduplication for resolve agent results, reducing redundant LLM calls from 2-8s to <1ms for cache hits
Changed resolve eval scoring from recall-only to F1 to penalize both missing and spurious values, and updated test expectations to use dynamic consent code computation
Set temperature=0.0 on extract and resolve agents for deterministic outputs (structure agent already had this setting)

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
backend/concept_search/resolve_agent.py	Implements LRU cache with TTL, in-flight deduplication, stats tracking, and refactors run_resolve to use caching
backend/tests/test_resolve_cache.py	Comprehensive test suite covering cache hits, misses, TTL, LRU eviction, normalization, deduplication, and stats
backend/concept_search/extract_agent.py	Sets temperature=0.0 for deterministic outputs
backend/concept_search/eval_resolve.py	Changes scoring to F1, adds dynamic consent code expectations via _consent_expected helper, updates measurement/focus test expectations
backend/concept_search/api.py	Exposes cache stats in /health endpoint

Comments suppressed due to low confidence (3)

backend/concept_search/resolve_agent.py:96

Memory leak potential: Expired cache entries are only removed when they are accessed (line 91 check) or when the cache reaches max_size (line 136-138). This means expired entries can accumulate and consume memory until they are either accessed or evicted by LRU. Consider implementing a periodic cleanup task or proactive expiration during cache operations to prevent unbounded memory growth when many unique keys are cached and then abandoned.

        async with self._lock:
            # 1. Cache hit?
            entry = self._cache.get(key)
            if entry and (time.monotonic() - entry.created) < self.ttl_seconds:
                self.hits += 1
                # Move to end for LRU ordering
                self._cache[key] = self._cache.pop(key)
                logger.info("resolve_cache hit key=%s", key)
                return entry.result

backend/concept_search/resolve_agent.py:167

Race condition: The stats property reads hits, misses, and len(self._cache) without holding the lock. In a multi-threaded environment, these values could be inconsistent (e.g., hits and misses could change between reads). Consider acquiring the lock when reading these values to ensure atomic reads, or use atomic operations.

    @property
    def stats(self) -> dict:
        """Return cache statistics."""
        total = self.hits + self.misses
        return {
            "hit_rate": round(self.hits / total, 3) if total else 0,
            "hits": self.hits,
            "misses": self.misses,
            "size": len(self._cache),
        }

backend/concept_search/resolve_agent.py:143

Error handling issue: If _run_resolve_uncached raises an exception, the result is never cached. However, the in-flight event is properly cleaned up in the finally block. Consider whether certain types of errors (e.g., transient LLM errors vs permanent validation errors) should be cached differently to avoid repeated failed attempts for the same mention.

        try:
            result = await _run_resolve_uncached(mention, index, model)
        finally:
            async with self._lock:
                ev = self._in_flight.pop(key, None)
                if ev is not None:
                    ev.set()

        # Store result
        async with self._lock:
            if len(self._cache) >= self.max_size:
                oldest = next(iter(self._cache))
                del self._cache[oldest]
            self._cache[key] = _CacheEntry(
                created=time.monotonic(), result=result
            )

        return result

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

backend/concept_search/resolve_agent.py

backend/tests/test_resolve_cache.py

backend/concept_search/resolve_agent.py

- Extract shared LRUCache[K, V] into cache.py — used by both resolve and pipeline caches, eliminating duplication - Add pipeline-level cache keyed on normalized query string — repeated queries skip all three agents (3.6s → 8ms) - clear_all() clears every registered cache instance at once - Both caches report stats in /health (pipelineCache, resolveCache) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T08:56:25Z

backend/concept_search/cache.py

+        self.misses += 1
+        logger.info("%s miss key=%s", self.name, key)
+        try:


logger.info("%s miss key=%s", ...) has the same issue as the hit log: it records the raw cache key (including user query/mention text) at INFO level, which is both sensitive and high-volume. Consider removing it, switching to DEBUG, and/or logging only anonymized key info.

Copilot · 2026-02-22T08:56:26Z

backend/concept_search/cache.py

+        key: K,
+        compute: asyncio.coroutines,
+    ) -> V:


The compute parameter type annotation is incorrect: asyncio.coroutines is not a useful callable/awaitable type here and won’t type-check. Consider typing this as an async callable (e.g., Callable[[], Awaitable[V]]) so callers and static type checkers have the right contract.

Copilot · 2026-02-22T08:56:26Z

backend/concept_search/pipeline.py

+    key = query.strip().lower()
+    return await pipeline_cache.get_or_compute(
+        key, lambda: _run_pipeline_uncached(query, index, model)
+    )


run_pipeline caches solely by normalized query text, but the function signature allows overriding index and model. If either differs between calls, this cache can return results produced with a different model or against a different index than requested. Consider including model (and some stable index identifier/version) in the cache key, or bypass caching when overrides are provided.

Copilot · 2026-02-22T08:56:26Z

backend/concept_search/resolve_agent.py

+    key = (mention.facet.value, mention.text.strip().lower())
+    return await resolve_cache.get_or_compute(
+        key, lambda: _run_resolve_uncached(mention, index, model)
+    )


run_resolve caches by (facet, normalized_text) but ignores the model argument and the index dependency. If callers override model (or pass a different ConceptIndex), this can return cached results computed with different inputs than requested. Consider incorporating model (and optionally an index version/id) into the key, or disabling caching when overrides are used.

Copilot · 2026-02-22T08:56:26Z

backend/concept_search/cache.py

+        try:
+            value = await compute()
+        finally:
+            async with self._lock:
+                ev = self._in_flight.pop(key, None)
+                if ev is not None:
+                    ev.set()
+
+        async with self._lock:
+            if len(self._cache) >= self.max_size:
+                oldest = next(iter(self._cache))
+                del self._cache[oldest]
+            self._cache[key] = _CacheEntry(
+                created=time.monotonic(), value=value
+            )
+


In-flight deduplication is currently racy: the in-flight Event is set in the finally block before the computed value is inserted into _cache. Waiters can wake up, not find the entry yet, and then proceed to compute again (breaking the “run only once” guarantee and potentially causing stampedes). Consider storing the result (or exception) and updating _cache while still holding the lock, and only then setting the event (or use an asyncio.Future/Task per key to propagate result/exception).

Suggested change

try:

value = await compute()

finally:

async with self._lock:

ev = self._in_flight.pop(key, None)

if ev is not None:

ev.set()

async with self._lock:

if len(self._cache) >= self.max_size:

oldest = next(iter(self._cache))

del self._cache[oldest]

self._cache[key] = _CacheEntry(

created=time.monotonic(), value=value

)

success = False

try:

value = await compute()

success = True

finally:

async with self._lock:

if success:

if len(self._cache) >= self.max_size:

oldest = next(iter(self._cache))

del self._cache[oldest]

self._cache[key] = _CacheEntry(

created=time.monotonic(), value=value

)

ev = self._in_flight.pop(key, None)

if ev is not None:

ev.set()

Copilot · 2026-02-22T08:56:27Z

backend/concept_search/cache.py

+                self.hits += 1
+                self._cache[key] = self._cache.pop(key)
+                logger.info("%s hit key=%s", self.name, key)
+                return entry.value


logger.info("%s hit key=%s", ...) logs the full cache key, which for pipeline_cache/resolve_cache includes user-provided query/mention text. Since logs go to CloudWatch at INFO level, this can leak potentially sensitive user input and will also be very noisy in a hot path. Consider removing this log, downgrading to DEBUG, and/or logging only non-sensitive metadata (e.g., cache name + a hash of the key).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- Fix race: store result in cache before setting in-flight event, so waiters find the entry immediately on wakeup - Add TTL check on waiter path to avoid returning expired entries - Downgrade cache hit/miss logs from INFO to DEBUG (hot-path noise) - Add exception handling tests: failed computes are not cached, and retries after failure work correctly Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

backend/tests/test_cache.py:97

The lambda closures lambda: _compute(k) in the loop capture the loop variable k by reference. While this works correctly in the current implementation because the lambda is called immediately upon cache miss (before the next iteration), this pattern is fragile and could lead to bugs if the cache implementation changes.

Consider using a default argument to capture the value: lambda k=k: _compute(k) or using a helper function that doesn't rely on closure over a loop variable. This makes the intent clearer and prevents potential future bugs.

    for k in ["a", "b", "c"]:
        await cache.get_or_compute(k, lambda: _compute(k))
    assert len(cache._cache) == 3

    await cache.get_or_compute("d", lambda: _compute("d"))
    assert len(cache._cache) == 3
    assert "a" not in cache._cache


@pytest.mark.asyncio()
async def test_lru_access_refreshes(cache: LRUCache[str, str]) -> None:
    """Accessing an entry should move it to the end (most recent)."""
    for k in ["a", "b", "c"]:
        await cache.get_or_compute(k, lambda: _compute(k))

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-22T09:10:39Z

backend/concept_search/resolve_agent.py

+    key = (mention.facet.value, mention.text.strip().lower())
+    return await resolve_cache.get_or_compute(
+        key, lambda: _run_resolve_uncached(mention, index, model)
+    )


The cache key for resolve only includes (facet, normalized_text), but doesn't include the model parameter. This means that if the same mention is resolved with different models, the second call will return cached results from the first model instead of actually using the specified model.

While the API and typical usage may not vary the model parameter, the function signature accepts it, and the resolve agent is called from the pipeline which also accepts a model parameter. This could lead to unexpected behavior if different models are used for the same mention.

Consider including the model in the cache key, or documenting that the cache is model-agnostic.

Copilot · 2026-02-22T09:10:39Z

backend/concept_search/cache.py

+    @property
+    def stats(self) -> dict:
+        """Return cache statistics."""
+        total = self.hits + self.misses
+        return {
+            "hit_rate": round(self.hits / total, 3) if total else 0,
+            "hits": self.hits,
+            "misses": self.misses,
+            "size": len(self._cache),
+        }


The stats property reads self.hits, self.misses, and len(self._cache) without acquiring self._lock. This could lead to race conditions where the stats are inconsistent or stale.

For example, if one coroutine is updating the cache while another calls stats, the hit rate calculation could be based on mismatched values of hits and misses, or the size could be inconsistent with the hit/miss counts.

Consider either:

Acquiring the lock in the stats property: async with self._lock: return {...}

Documenting that stats are eventually consistent and may be slightly stale

Using atomic operations or a separate lock for stats if performance is a concern

Note that making this an async property would require changing all call sites to await it.

Copilot · 2026-02-22T09:10:39Z

backend/concept_search/pipeline.py

+    key = query.strip().lower()
+    return await pipeline_cache.get_or_compute(
+        key, lambda: _run_pipeline_uncached(query, index, model)


The cache key for the pipeline only includes the normalized query string, but doesn't include the model parameter. This means that if the same query is run with different models (e.g., via the CLI with --model), the second call will return cached results from the first model instead of actually using the specified model.

For example, if a user runs python -m concept_search.cli "diabetes" --model anthropic:claude-sonnet-4 and then runs python -m concept_search.cli "diabetes" --model anthropic:claude-haiku-4, the second call will return the cached result from Sonnet instead of re-running with Haiku.

Consider including the model in the cache key, or documenting that the cache is model-agnostic (though this seems unlikely to be the intended behavior).

NoopDog and others added 3 commits February 22, 2026 00:05

NoopDog requested a review from Copilot February 22, 2026 08:16

Copilot started reviewing on behalf of NoopDog February 22, 2026 08:17 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog requested a review from Copilot February 22, 2026 08:52

Copilot started reviewing on behalf of NoopDog February 22, 2026 08:53 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog and others added 2 commits February 22, 2026 00:59

fix: correct type annotation for compute parameter in LRUCache #200

3aac4ef

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

NoopDog requested a review from Copilot February 22, 2026 09:04

Copilot started reviewing on behalf of NoopDog February 22, 2026 09:04 View session

Copilot AI reviewed Feb 22, 2026

View reviewed changes

NoopDog merged commit 604196d into main Feb 22, 2026
5 checks passed

NoopDog deleted the noopdog/200/resolve-cache branch February 22, 2026 09:14

github-actions bot mentioned this pull request Feb 22, 2026

chore(main): release 0.11.0 #200

Open

Comments

Conversation

NoopDog commented Feb 22, 2026

Summary

Cache details

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant