- 
                Notifications
    You must be signed in to change notification settings 
- Fork 70
Initial cache fixes #192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial cache fixes #192
Conversation
| Could we add a test to https://github.com/lmstudio-ai/mlx-engine/blob/31e9b6d38db2417784687e3b12b90275f80828f8/tests/test_text_models.py that confirms the new trimming behavior fixes the "invalidate cache fully when generation overflows context" case? I.e., make sure that from an e2e model running perspective, this behavior is as expected? | 
| self.max_kv_size is not None | ||
| and self.is_rotating | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there cases where self.max_kv_size is not None and it's not a rotating cache? Probably when a model has a custom kv cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No: almost all cases where a model has a custom KV cache are recurrent/hybrid models where they have to use MambaKVCaches for certain layers. There are a few architectures that override to force certain other combinations of caches, but all of them are combinations of the MLX default caches, and no model implements its own custom KV cache. So this is comprehensive. We discussed how to handle this and it seems like this is the best option since the cache itself doesn't expose an API for this.
Co-authored-by: Matt Clayton <[email protected]>
| Good to merge after engine release with support for LFM2 goes out | 
| Closing b/c stale. We can bring this back as desired. | 
Closes #177 and #179. These are the bugfixes I was going to introduce in #188 but the reuse part ended up taking much longer than I thought it would and so these fixes are going in first.
The major thing this PR does is allow us to properly trim from the KV cache (#177). This, however, requires joint fixes to
keep(#179) and token tracking (no open issue) because those cause bugs that were previously silent: since we previously nuked the cache before the cache could rotate, bugs caused by actually rotating the cache didn't show up until now.cc @mattjcly