Skip to content

MLX RotatingKVCache trim behavior causes context overflow policies to always erase the whole cache #177

@christian-lms

Description

@christian-lms

We defer KV cache creation to mlx_lm.models.cache.make_prompt_cache, which itself defers to the model for implementation. Most models (with some exceptions, e.g. the DeepSeeks) don't implement their own KV caches, so MLX defaults to a mlx_lm.models.cache.RotatingKVCache that uses a circular buffer to manage an arbitrary length generation. This avoids complexities introduced by manually shifting around a fixed-size linear cache (cf. llama.cpp), but introduces some of its own problems: namely, if we have generated n tokens where n > max_kv_size, the cache will no longer let us trim from the end of it.

This is a problem because we want to trim from the cache exactly when we have generated more than max_kv_size tokens, since that's the whole point of a context overflow policy! What occurs in practice is that mlx_engine.cache_wrapper._get_unprocessed_tokens attempts to trim the cache in accordance with the context overflow policy, but then fails because the cache rejects the trim request on account of being over capacity. The trim then falls back to erasing the entire cache, which requires usually thousands to tens of thousands of tokens to be unnecessarily recomputed at great expense. This is not ideal.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions