-
Notifications
You must be signed in to change notification settings - Fork 13.5k
server : reuse context chunks #9866
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Does this work similar to Koboldcpp's context shift? |
|
Yes, it's the same idea as proposed in #5793. I've been experimenting today with context reuse for code completion and results seem promising. |
|
Btw @ggerganov a while ago I remembered there was a discussion on storing token ID on KV cache. I'm wondering if it's complicated to add API like |
ggml-ci
a6b048e to
27addf5
Compare
|
We should extend the API to support that. Maybe |
|
I have a small question regarding the illustration on the description:
AFAIU we only skip the |
|
It's skipped mainly to simplify the batch construction: With the current implementation, we stop reusing chunks at the first token that cannot be reused. This way, when we create the The alternative that you suggest is if we reused the There is no longer the concept of I'm very interested in trying this approach and see if it is viable, but the extra complexity at this point would be took much. Maybe in the future. |
|
What are the downside of sticking to Thanks. |
|
Not sure if there are downsides yet - needs testing. Intuitively, reusing very small chunks might not be a good idea since they can have different meanings. |
|
Call me an ignorant but from my understanding of this feature we can cache parts of the prompts. Meaning that in a prompt of 20k tokens, we can take part of this processed prompt and reuse it later outside of it's original context. Which means that this piece of data can be individually used outside of it's neighbors/context. Then why the farthest we are in prompt processing, the slower it goes if prompts parts can be processed individually ? Right now i'm toying with Qwen 7B 1M and a context size of TWO MILLIONS tokens. Obviously, around 100k tokens it starts getting slow as hell, but if I individually processed each document alone (not slow) and later shoved the whole 2M chunk that would be cached, it would be faster than just processing it as it is right now ? |
ref #5793
Overview
Using a positive
--cache-reuseargument withllama-serverwill attempt to reuse KV chunks with size equal or larger than the specified value. The KV cache of reused chunks will be shifted (seellama_kv_cache_seq_add()) in the respective position and processing for these tokens will be skipped.Only chunks without control/special tokens will be reused.Here is an illustration:Upon submitting
prompt 1for processing, afterprompt 0has been processed and cached:--cache-reuse 0: only theaaaaaprefix will be reused--cache-reuse 1: the entireaaaaaccccccceeeeeeffhhhhhhhwill be reused--cache-reuse 3: only theaaaaaccccccceeeeeepart will be reusedThe cache reuse will be done only for requests with
"cache_prompt": true.Example