You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add SQLite cache backend (193x faster than file-based at scale)
Replace JSON bin-file approach with per-model SQLite databases.
Activated via SQLITE_CACHE=true env var or use_sqlite=True in get_cache_manager().
Key improvements:
- O(1) lookup by primary key (no loading entire 28MB bin files)
- WAL mode for concurrent readers without blocking
- Connection pooling (reuse across calls)
- zstd compression (~559MB JSON → 3MB SQLite)
- Schema versioning (stale entries = clean cache miss)
- Batch lookups via SQL IN clause
- Built-in hit/miss/cost statistics
Benchmark (3500 entries, 10k lookups with 65% miss rate, 28MB/bin):
File-based: 484.9s (event loop frozen 8 min)
SQLite: 2.5s (193x faster)
The pathology: FileBasedCacheManager reloads the ENTIRE bin from disk
on every cache miss (to check if another process wrote the entry).
With 6500 misses × 28MB bins = 182GB of JSON parsing serialized on
the event loop. SQLite misses are a single B-tree lookup returning NULL.
Also fixes pre-existing pyright errors in cache_manager.py (nullable
responses field on LLMCache, redis type annotations).
LOGGER.warning("Cache does not contain completion; likely due to recitation")
282
284
else:
283
285
LOGGER.warning(
284
286
f"Proportion of cache responses that contain empty completions ({prop_empty_completions}) is greater than threshold {empty_completion_threshold}. Likely due to recitation"
285
287
)
286
-
failed_cache_response=cached_result.responses
288
+
failed_cache_response=responses_list
287
289
cached_result=None
288
290
cached_response=None
289
291
else:
290
-
cached_response= (
291
-
cached_result.responses
292
-
) # We want a list of LLMResponses if we have n responses in a cache
292
+
cached_response=responses_list
293
293
ifinsufficient_valids_behaviour!="continue":
294
-
assert (
295
-
len(cached_result.responses) ==n
296
-
), f"cache is inconsistent with n={n}\n{cached_result.responses}"
294
+
assertlen(responses_list) ==n, f"cache is inconsistent with n={n}\n{responses_list}"
), f"There should be the same number of responses and failed_cache_responses! Instead we have {len(responses)} responses and {len(failed_cache_responses)} failed_cache_responses."
LOGGER.warning("Cache does not contain completion; likely due to recitation")
490
487
else:
491
488
LOGGER.warning(
492
489
f"Proportion of cache responses that contain empty completions ({prop_empty_completions}) is greater than threshold {empty_completion_threshold}. Likely due to recitation"
493
490
)
494
-
failed_cache_response=cached_result.responses
491
+
failed_cache_response=responses_list
495
492
cached_result=None
496
493
cached_response=None
497
494
else:
498
-
cached_response=cached_result.responses
495
+
cached_response=responses_list
499
496
ifinsufficient_valids_behaviour!="continue":
500
-
assert (
501
-
len(cached_result.responses) ==n
502
-
), f"cache is inconsistent with n={n}\n{cached_result.responses}"
497
+
assertlen(responses_list) ==n, f"cache is inconsistent with n={n}\n{responses_list}"
), f"There should be the same number of responses and failed_cache_responses! Instead we have {len(responses)} responses and {len(failed_cache_responses)} failed_cache_responses."
0 commit comments