Options for supporting multiple users wrt. slots, KV-cache, --prompt-cache, --slot-save #681
Replies: 2 comments 5 replies
-
|
What do you mean "--slot-save doesn't seem to do anything". I find it really useful, and am glad I made it work with MLA models as well (see #497), and did and have a UI for managing it with mikupad (still a WIP but screenshots here #558 (comment)) I do agree the state of KV state management has a lot of potential upgrades, I had being able to automatically use any saves in the |
Beta Was this translation helpful? Give feedback.
-
|
As related to the KV-cache. I just recenly realized that ik_llama.cpp does have the RPC to store/restore the KV-cache. The options that I am using right now are: Very sadly, I do not understand what they are and how they work. I tried to find the documentation regarding them, but I failed. At least, I was able to do the basic functionality of saving and restoring the KV-cache for the system with multiple GPUs. Its working. And the KV-cache itself is kinda compressed/optimal (unlike what is done in ktransformers for example, where you have to employ the ZFS etc). I think there should be just three options. The software should estimate how long it takes to store/restore the KV-cache, what is the approx. prefill speed of the current setup and based on that, it should determine the minimal length of the prompt for which the KV-cache should be swapped to the storage device to be automatically restored later on. The old/least used dumps are automatically deleted if the storage quota is exhausted. This is all anyone would ever need from such a system. What are the state of affairs regarding this? The exposed RPC for us to trigger it and do the slot management? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
--prompt-cache doesn't seem to do anything, despite having some implementation that reads and populates the value
--slot-save doesn't seem to do anything, and reading the /slot/ handlers I'm not even sure how it's intended to be used.
With -np 1, two users clobber each other's KV-cache, even if users don't mind waiting for a request to be filled. (e.g. 10s of minutes to rebuild a deep ctx slot)
Can KV-cache be (slot-wise) split across CUDA-devs?
Currently it seems like the only way to get two non-clobbering slots would be to avoid GPU use and go CPU only where there's "unlimited" ctx space.
What's the state of affairs wrt. these params and what options are there?
Could an LRU-cache of slots be implemented?
CPU backends don't offer any meaningful concurrency, but swapping out slots would go a long way.
Have I missed something? 😅
Beta Was this translation helpful? Give feedback.
All reactions