Cache reuse and cache fixes #188

christian-lms · 2025-07-10T21:18:24Z

This PR does two things (for text-only models): fix a bunch of bugs with the KV cache implementation and bring cache reuse to MLX. The two are so intertwined that it ended up being easiest to put them both in one PR.

To the former, this PR closes #177, #179, and another undocumented bug:

Token tracking previously did not respect the rotating cache looping. This was silent previously because we nuked the cache every time it looped so this never got a chance to run.

To the latter, we spin our own variant of MLX-LM's RotatingKVCache to allow us to trim it properly and Unlike the llama.cpp implementation, this does not RoPE shift reused chunks. (Actually, when I tried RoPE shifting, it gave gibberish responses.) This aligns more closely with what MLX-LM does under the hood (they're fine with discontinuous positional embeddings), so I'm inclined to believe this is fine, though this will need testing.

We also don't change the offset of the KV cache at all, which tells the LLM's positional embeddings where we are in the overall sequence. Again, this aligns with what MLX-LM does, and will need testing to validate.

Known issues that remain as of posting include:

Qwen3 (and maybe other models) still have issues with reuse when thinking is disabled due to the empty <think></think> block being counted in n_keep. This PR fixes the previous behavior (Qwen3 cache wastage #176) whereby the last generation would get nuked, but the whole cache still gets nuked when the context overflows due to a miscounted n_keep retaining tokens from the assistant message it wasn't supposed to and consequently preventing later sections from being reused. This may not be a blocker on account of the fact that it's not technically regressing and we know where the behavior comes from.
Models that choose to spin their own cache still have the old behavior. Sometimes this is important (i.e. for hybrid models) but there are a few architectures that implement their own makeCache function despite having no good reason to do so. These aren't very popular ones, so we can either 1) monkeypatch this, 2) upstream a change into MLX-LM, or 3) ignore the issue altogether.

Technically the code is mergeable, but I want to do more testing before we finalize this.

cc @mattjcly @neilmehta24 @yagil

…y (had to pipe in is_key)

…/v at once, which bypasses the former stupid hack

github-actions · 2025-07-10T21:18:33Z

All contributors have signed the CLA ✍️ ✅
_{Posted by the CLA Assistant Lite bot.}

christian-lms · 2025-07-11T17:57:07Z

Fixes are now going in #192

christian-lms added 29 commits July 7, 2025 10:53

loop record generated token

f11604b

shifting kv cache

64113b3

override another method to use rope shift

19cde88

add testing asserts

f4822f0

warn

e297092

move cache into a separate file

ee316db

begin raw unit tests

3df5d31

initial uncommented tests

514b6c5

manual rope stuff

3f64566

pipe in n_keep

22d8a83

a few more cache wrapper tests

d165867

ruff formatting lmao

199d231

record_generated_token test and fix

8dcb82d

prelim reuse code

56d34e0

maybe reuse unit test

3741509

code reuse!

a677f86

cache shift test comments

77e523c

stop rope shifting values and set keep

2a8855a

cache is a list, and exclude tokens in the right place

447a134

same for tests

16fc7a1

apply that to tests too oops

85f2241

decouple from rotatingkvcache since so much of it was rewritten anywa…

6ac8d2f

…y (had to pipe in is_key)

working reuse test

d124c0e

cache offsets ooooooooooooooooooooops

8ac2bae

refactor trim/temporal order internal interfaces to operate on both k…

e373181

…/v at once, which bypasses the former stupid hack

more test fixes

4b938be

refactor tests

d7c4ce7

technically if you ran this it would work

5d60f13

Merge branch 'lmstudio-ai:main' into christian/cache_reuse_again

642a8c3

christian-lms added 8 commits July 10, 2025 17:20

properly works now (i think)

9c378e6

try to remove rope

aed67a9

simplify cache again

1b9af40

more reductionism

a1521d8

remove prints

dd205ba

??? oops

a740335

final fixes for now

349b5a8

more fixes

1c4cf24

lmstudio-windows force-pushed the christian/cache_reuse_again branch from acf168b to 1c4cf24 Compare July 10, 2025 21:20

github-actions bot added the CLA signed Indicates that all contributors have signed label Jul 10, 2025

christian-lms added 3 commits July 10, 2025 17:22

make linter happy

bf66e2b

fix trim

3d51d58

extra tests (in progress)

18b6dc3

christian-lms mentioned this pull request Jul 11, 2025

Initial cache fixes #192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cache reuse and cache fixes #188

Cache reuse and cache fixes #188

Uh oh!

christian-lms commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025 •

edited

Loading

Uh oh!

christian-lms commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Cache reuse and cache fixes #188

Are you sure you want to change the base?

Cache reuse and cache fixes #188

Uh oh!

Conversation

christian-lms commented Jul 10, 2025

Uh oh!

github-actions bot commented Jul 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

christian-lms commented Jul 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions bot commented Jul 10, 2025 •

edited

Loading