MLX `RotatingKVCache` trim behavior causes context overflow policies to always erase the whole cache

We defer KV cache creation to `mlx_lm.models.cache.make_prompt_cache`, which itself defers to the model for implementation. Most models (with some exceptions, e.g. the DeepSeeks) don't implement their own KV caches, so MLX defaults to a `mlx_lm.models.cache.RotatingKVCache` that uses a circular buffer to manage an arbitrary length generation. This avoids complexities introduced by manually shifting around a fixed-size linear cache (cf. llama.cpp), but introduces some of its own problems: namely, if we have generated `n` tokens where `n > max_kv_size`, the cache will no longer let us `trim` from the end of it.

This is a problem because we want to trim from the cache **exactly when** we have generated more than `max_kv_size` tokens, since that's the whole point of a **context overflow** policy! What occurs in practice is that `mlx_engine.cache_wrapper._get_unprocessed_tokens` attempts to trim the cache in accordance with the context overflow policy, but then fails because the cache rejects the trim request on account of being over capacity. The trim then falls back to erasing the entire cache, which requires usually thousands to tens of thousands of tokens to be unnecessarily recomputed at great expense. This is not ideal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MLX `RotatingKVCache` trim behavior causes context overflow policies to always erase the whole cache #177

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MLX RotatingKVCache trim behavior causes context overflow policies to always erase the whole cache #177

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

MLX `RotatingKVCache` trim behavior causes context overflow policies to always erase the whole cache #177