-
Notifications
You must be signed in to change notification settings - Fork 13.4k
kv-cache : refactor + add llama_memory_state_i #13746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d23f887 to
8323e23
Compare
c1434b8 to
1eec34a
Compare
|
This PR should not cause any performance changes and the numerical results should be mostly the same (with some small exceptions due to the new logic in Would appreciate some testing and reports for regressions. Thanks. |
|
I re-run the ppl test from #13194 (comment) master at aa50ba4 This PR: Some results changed very slightly, so I'm not sure if this is expect |
|
Yes, I think this difference is expected for SWA models (note Phi currently is disabled SWA, so no difference). It's caused by the different order in which we place the data in memory, due to the |
|
Yes that's right, I added Edit: except for |
This comment was marked as resolved.
This comment was marked as resolved.
|
I re-run the test and the ppl stays the same as my last comment. Btw, just thinking, is it possible (and it is useful) to add a ppl test mode that uses the KV remove API? |
The ./bin/llama-perplexity -hf bartowski/gemma-2-9b-it-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -c 16384 -fa --chunks 2 --swa-fullMaybe your reference value on
Can you clarify? |
|
I can't run the ppl rn, but if you get correct result, then I think yes could be a problem on my side.
Currently, AFAIU the ppl test simply evaluate text chunk by chunk, but only going forward. For example, if I have 3 chunks: 1-2-3, then they will be evaluated in the order of 1-2-3 But what we also what to test is for example:
So I expect the ppl to be the same as just doing 1-2-3 |
3ef770f to
0b73da5
Compare
|
How does this recover from a failed call to |
There are some tricky scenarios in which we could have overwritten some of the data in the cache by the time the error occurs (i.e. processed the first few ubatches, but not all of them yet). Before (i.e. on I think that on compute error, the KV cache should be assumed in an undefined state and the application should take necessary steps to recover (i.e. by clearing it and reprocessing the context that is currently needed). Later on, this reprocessing will become seamless, when we start storing the necessary tokens/embeddings information and add the logic for auto-reprocessing whatever is currently missing from the cache. |
|
I am mostly concerned about the abort callback functionality. Errors in the backend are likely to be unrecoverable, but I am not sure if the abort functionality makes sense if it leaves the cache in a bad state. |
|
I admit that I had completely forgotten about the abort callback. Let me see if we can do something about this. |
0b73da5 to
2252eef
Compare
|
Drafting for now as I want to do some more testing and think about the abort mechanism. |
ggml-ci
ggml-ci
f23e4cc to
71619f2
Compare
|
Yes, it seemed it was closed automatically. Feel free to update it and submit a new PR. Btw, I think we should first fix the recurrent cache and also make the iSWA cache work with |
|
Great, will do. I also saw that #13834 got moved to draft, so wasn't sure the status on that one. With all of the decoupling, I think the recurrent cache should be largely independent of the others (the only change it needs to recurrent is adding the layer filter), so I'm happy to take whatever order makes the most sense on your end. |
KV caching and defragging have been completely overhauled since the last bump, so this patch no longer has a logical home. ggml-org/llama.cpp#13746 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>
cont #13706 (comment), #13194
Main goal here is to simplify the abstract interface of
struct llama_kv_cache.Overview
Changes to the internal
struct llama_kv_cacheabstract interface:llama_kv_cache::commit()llama_kv_cache::restore()llama_kv_cache::sbatch_init()llama_kv_cache::ubatch_next()llama_kv_cache::find_slot()llama_kv_cache_guard--- llama-memory.h // the interface for managing the memory state during batch processing // this interface is implemented per memory type. see: // - llama_kv_cache_unified_state // - llama_kv_cache_unified_iswa_state // ... // // the only method that can mutate the memory and the memory state is llama_memory_i::apply() // // TODO: rename to llama_memory_context_i ? class llama_memory_state_i { public: virtual ~llama_memory_state_i() = default; // consume the current ubatch from the state and proceed to the next one // return false if we are done virtual bool next() = 0; // apply the memory state for the current ubatch to the memory object // return false on failure virtual bool apply() = 0; // TODO: this might get reworked in the future when refactoring llama_batch virtual std::vector<int64_t> & out_ids() = 0; // get the current ubatch virtual const llama_ubatch & get_ubatch() const = 0; // get the status of the memory state virtual llama_memory_status get_status() const = 0; }; using llama_memory_state_ptr = std::unique_ptr<llama_memory_state_i>; --- llama-kv-cache.h // split the input batch into a set of ubatches and verify that they can fit into the cache // return a state object containing the ubatches and KV cache state required to process them // check the llama_memory_state_i::get_status() for the result virtual llama_memory_state_ptr init_batch( const llama_batch & batch, uint32_t n_ubatch, bool embd_pooled, bool logits_all) = 0; // simulate full cache, used for allocating worst-case compute buffers virtual llama_memory_state_ptr init_full() = 0;This new interface changes the logic in
llama_decode()to first make sure that we can fit the input batch into the cache and only after that we start to process the ubatches. This check takes correctly into account SWA masking and also makes sure that the cache will not be modified before we start the actual computation.note: the latter is not yet true for the recurrent cache - see comments in the code
Another important update in this PR is that the
find_slot()logic for unified caches is now improved. Before we looked for a slot (i.e. a set of contiguous cells) that is empty in order to place the ubatch in it. We now allow the slot to contain data from the same or other sequence which is masked (either by causality or by SWA):llama.cpp/src/llama-kv-cache.cpp
Lines 574 to 621 in 2252eef
This change is needed for the next PR, which will optimize the SWA cache to use just
n_swa + n_ubatchcells and it also has some other nice properties. For example, we no longer have to explicitly prune tokens on successful batch processing, which simplifies the logic significantly and allows us to re-enable speculative decoding for SWA models (will be done also in the next PR).The worst-graph reserve logic is also refactored and simplified significantly.
There are also some changes to
llama-batch, but these are mainly to patch things up so that we are able to push the KV cache refactor first. So no need to review thellama-batchin deep details - the code there will be reworked soon.TODO
Next PRs
llama_decode, so that user code does not have to do it (llama : auto-batch preparation #13845)n_swa + n_ubatchfor SWA cache (llama : use n_swa + n_ubatch cells for SWA cache #13833)