-
Notifications
You must be signed in to change notification settings - Fork 70
Description
We defer KV cache creation to mlx_lm.models.cache.make_prompt_cache
, which itself defers to the model for implementation. Most models (with some exceptions, e.g. the DeepSeeks) don't implement their own KV caches, so MLX defaults to a mlx_lm.models.cache.RotatingKVCache
that uses a circular buffer to manage an arbitrary length generation. This avoids complexities introduced by manually shifting around a fixed-size linear cache (cf. llama.cpp), but introduces some of its own problems: namely, if we have generated n
tokens where n > max_kv_size
, the cache will no longer let us trim
from the end of it.
This is a problem because we want to trim from the cache exactly when we have generated more than max_kv_size
tokens, since that's the whole point of a context overflow policy! What occurs in practice is that mlx_engine.cache_wrapper._get_unprocessed_tokens
attempts to trim the cache in accordance with the context overflow policy, but then fails because the cache rejects the trim request on account of being over capacity. The trim then falls back to erasing the entire cache, which requires usually thousands to tens of thousands of tokens to be unnecessarily recomputed at great expense. This is not ideal.