Options for supporting multiple users wrt. slots, KV-cache, --prompt-cache, --slot-save #681
usrlocalben
started this conversation in
General
Replies: 1 comment 2 replies
-
What do you mean "--slot-save doesn't seem to do anything". I find it really useful, and am glad I made it work with MLA models as well (see #497), and did and have a UI for managing it with mikupad (still a WIP but screenshots here #558 (comment)) I do agree the state of KV state management has a lot of potential upgrades, I had being able to automatically use any saves in the |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
--prompt-cache doesn't seem to do anything, despite having some implementation that reads and populates the value
--slot-save doesn't seem to do anything, and reading the /slot/ handlers I'm not even sure how it's intended to be used.
With -np 1, two users clobber each other's KV-cache, even if users don't mind waiting for a request to be filled. (e.g. 10s of minutes to rebuild a deep ctx slot)
Can KV-cache be (slot-wise) split across CUDA-devs?
Currently it seems like the only way to get two non-clobbering slots would be to avoid GPU use and go CPU only where there's "unlimited" ctx space.
What's the state of affairs wrt. these params and what options are there?
Could an LRU-cache of slots be implemented?
CPU backends don't offer any meaningful concurrency, but swapping out slots would go a long way.
Have I missed something? 😅
Beta Was this translation helpful? Give feedback.
All reactions