Skip to content

Conversation

christian-lms
Copy link
Contributor

This PR does two things (for text-only models): fix a bunch of bugs with the KV cache implementation and bring cache reuse to MLX. The two are so intertwined that it ended up being easiest to put them both in one PR.

To the former, this PR closes #177, #179, and another undocumented bug:

  • Token tracking previously did not respect the rotating cache looping. This was silent previously because we nuked the cache every time it looped so this never got a chance to run.

To the latter, we spin our own variant of MLX-LM's RotatingKVCache to allow us to trim it properly and Unlike the llama.cpp implementation, this does not RoPE shift reused chunks. (Actually, when I tried RoPE shifting, it gave gibberish responses.) This aligns more closely with what MLX-LM does under the hood (they're fine with discontinuous positional embeddings), so I'm inclined to believe this is fine, though this will need testing.

We also don't change the offset of the KV cache at all, which tells the LLM's positional embeddings where we are in the overall sequence. Again, this aligns with what MLX-LM does, and will need testing to validate.

Known issues that remain as of posting include:

  • Qwen3 (and maybe other models) still have issues with reuse when thinking is disabled due to the empty <think></think> block being counted in n_keep. This PR fixes the previous behavior (Qwen3 cache wastage #176) whereby the last generation would get nuked, but the whole cache still gets nuked when the context overflows due to a miscounted n_keep retaining tokens from the assistant message it wasn't supposed to and consequently preventing later sections from being reused. This may not be a blocker on account of the fact that it's not technically regressing and we know where the behavior comes from.
  • Models that choose to spin their own cache still have the old behavior. Sometimes this is important (i.e. for hybrid models) but there are a few architectures that implement their own makeCache function despite having no good reason to do so. These aren't very popular ones, so we can either 1) monkeypatch this, 2) upstream a change into MLX-LM, or 3) ignore the issue altogether.

Technically the code is mergeable, but I want to do more testing before we finalize this.

cc @mattjcly @neilmehta24 @yagil

@github-actions
Copy link

github-actions bot commented Jul 10, 2025

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@lmstudio-windows lmstudio-windows force-pushed the christian/cache_reuse_again branch from acf168b to 1c4cf24 Compare July 10, 2025 21:20
@github-actions github-actions bot added the CLA signed Indicates that all contributors have signed label Jul 10, 2025
@christian-lms
Copy link
Contributor Author

Fixes are now going in #192

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MLX RotatingKVCache trim behavior causes context overflow policies to always erase the whole cache

1 participant