kv-cache : refactor + add llama_memory_state_i #13746

ggerganov · 2025-05-24T11:12:05Z

Main goal here is to simplify the abstract interface of struct llama_kv_cache.

Overview

Changes to the internal struct llama_kv_cache abstract interface:

Remove llama_kv_cache::commit()
Remove llama_kv_cache::restore()
Remove llama_kv_cache::sbatch_init()
Remove llama_kv_cache::ubatch_next()
Remove llama_kv_cache::find_slot()
Remove llama_kv_cache_guard
Add:

--- llama-memory.h

    // the interface for managing the memory state during batch processing
    // this interface is implemented per memory type. see:
    //   - llama_kv_cache_unified_state
    //   - llama_kv_cache_unified_iswa_state
    //   ...
    //
    // the only method that can mutate the memory and the memory state is llama_memory_i::apply()
    //
    // TODO: rename to llama_memory_context_i ?
    class llama_memory_state_i {
    public:
        virtual ~llama_memory_state_i() = default;
    
        // consume the current ubatch from the state and proceed to the next one
        // return false if we are done
        virtual bool next() = 0;
    
        // apply the memory state for the current ubatch to the memory object
        // return false on failure
        virtual bool apply() = 0;
    
        // TODO: this might get reworked in the future when refactoring llama_batch
        virtual std::vector<int64_t> & out_ids() = 0;
    
        // get the current ubatch
        virtual const llama_ubatch & get_ubatch() const = 0;
    
        // get the status of the memory state
        virtual llama_memory_status get_status() const = 0;
    };
    
    using llama_memory_state_ptr = std::unique_ptr<llama_memory_state_i>;

--- llama-kv-cache.h

    // split the input batch into a set of ubatches and verify that they can fit into the cache
    // return a state object containing the ubatches and KV cache state required to process them
    // check the llama_memory_state_i::get_status() for the result
    virtual llama_memory_state_ptr init_batch(
            const llama_batch & batch,
            uint32_t n_ubatch,
            bool embd_pooled,
            bool logits_all) = 0;

    // simulate full cache, used for allocating worst-case compute buffers
    virtual llama_memory_state_ptr init_full() = 0;

This new interface changes the logic in llama_decode() to first make sure that we can fit the input batch into the cache and only after that we start to process the ubatches. This check takes correctly into account SWA masking and also makes sure that the cache will not be modified before we start the actual computation.

note: the latter is not yet true for the recurrent cache - see comments in the code

Another important update in this PR is that the find_slot() logic for unified caches is now improved. Before we looked for a slot (i.e. a set of contiguous cells) that is empty in order to place the ubatch in it. We now allow the slot to contain data from the same or other sequence which is masked (either by causality or by SWA):

llama.cpp/src/llama-kv-cache.cpp

Lines 574 to 621 in 2252eef

    
           // keep track of what the minimum sequence positions would be if we accept the ubatch 
        
           llama_seq_id seq_pos_min[LLAMA_MAX_PARALLEL_SEQUENCES]; 
        
           for (int s = 0; s < LLAMA_MAX_PARALLEL_SEQUENCES; ++s) { 
        
               seq_pos_min[s] = cells.seq_pos_min(s); 
        
           } 
        
           bool found = true; 
        
           for (uint32_t i = 0; i < n_tokens; i++) { 
        
               const llama_pos    pos    = ubatch.pos[i]; 
        
               const llama_seq_id seq_id = ubatch.seq_id[i][0]; 
        
               // can we use this cell? either: 
        
               //  - the cell is empty 
        
               //  - the cell is occupied only by one sequence: 
        
               //    - mask causally, if the sequence is the same as the one we are inserting 
        
               //    - mask SWA, using current max pos for that sequence in the cache 
        
               //                always insert in the cell with minimum pos 
        
               bool can_use = cells.is_empty(head_cur + i); 
        
               if (!can_use && cells.seq_count(head_cur + i) == 1) { 
        
                   const llama_pos pos_cell = cells.pos_get(head_cur + i); 
        
                   // causal mask 
        
                   if (cells.seq_has(head_cur + i, seq_id)) { 
        
                       can_use = pos_cell >= pos; 
        
                   } 
        
                   if (!can_use) { 
        
                       const llama_seq_id seq_id_cell = cells.seq_get(head_cur + i); 
        
                       // SWA mask 
        
                       if (pos_cell == seq_pos_min[seq_id_cell] && 
        
                           is_masked_swa(pos_cell, cells.seq_pos_max(seq_id_cell) + 1)) { 
        
                           seq_pos_min[seq_id_cell]++; 
        
                           can_use = true; 
        
                       } 
        
                   } 
        
               } 
        
               if (!can_use) { 
        
                   found = false; 
        
                   head_cur += i + 1; 
        
                   n_tested += i + 1; 
        
                   break; 
        
               } 
        
           }

This change is needed for the next PR, which will optimize the SWA cache to use just n_swa + n_ubatch cells and it also has some other nice properties. For example, we no longer have to explicitly prune tokens on successful batch processing, which simplifies the logic significantly and allows us to re-enable speculative decoding for SWA models (will be done also in the next PR).

The worst-graph reserve logic is also refactored and simplified significantly.

There are also some changes to llama-batch, but these are mainly to patch things up so that we are able to push the KV cache refactor first. So no need to review the llama-batch in deep details - the code there will be reworked soon.

TODO

Adapt the recurrent cache to the new interface
Test optimization workflow

Next PRs

Absorb the batch reduction logic into llama_decode, so that user code does not have to do it (llama : auto-batch preparation #13845)
Enable n_swa + n_ubatch for SWA cache (llama : use n_swa + n_ubatch cells for SWA cache #13833)
Auto-purge future tokens after successfully processing a batch

ggerganov · 2025-05-25T14:52:40Z

This PR should not cause any performance changes and the numerical results should be mostly the same (with some small exceptions due to the new logic in find_slot()).

Would appreciate some testing and reports for regressions. Thanks.

src/llama-kv-cache.h

ngxson · 2025-05-25T16:36:45Z

I re-run the ppl test from #13194 (comment)

master at aa50ba4

OK:   Final estimate: PPL = 7.8002 +/- 0.17654   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 37.6848 +/- 1.03389   bartowski/gemma-2-9b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.9658 +/- 0.11216   lmstudio-community/Phi-3.1-mini-128k-instruct-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.2653 +/- 0.09581   bartowski/CohereForAI_c4ai-command-a-03-2025-GGUF:IQ1_M
OK:   Final estimate: PPL = 7.3320 +/- 0.16048   unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

This PR:

OK:   Final estimate: PPL = 7.8003 +/- 0.17654   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 37.6620 +/- 1.03339   bartowski/gemma-2-9b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.9658 +/- 0.11216   lmstudio-community/Phi-3.1-mini-128k-instruct-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.2642 +/- 0.09577   bartowski/CohereForAI_c4ai-command-a-03-2025-GGUF:IQ1_M
OK:   Final estimate: PPL = 7.3302 +/- 0.16037   unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

Some results changed very slightly, so I'm not sure if this is expect

ggerganov · 2025-05-25T17:14:30Z

Yes, I think this difference is expected for SWA models (note Phi currently is disabled SWA, so no difference). It's caused by the different order in which we place the data in memory, due to the find_slot() updates. The results become identical with --swa-full - can you confirm?

ngxson · 2025-05-25T18:02:41Z

Yes that's right, I added --swa-full and now it become identical to master version:

OK:   Final estimate: PPL = 7.8002 +/- 0.17654   ggml-org/gemma-3-4b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 37.7017 +/- 1.03468   bartowski/gemma-2-9b-it-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.9658 +/- 0.11216   lmstudio-community/Phi-3.1-mini-128k-instruct-GGUF:Q4_K_M
OK:   Final estimate: PPL = 5.2654 +/- 0.09581   bartowski/CohereForAI_c4ai-command-a-03-2025-GGUF:IQ1_M
OK:   Final estimate: PPL = 7.3320 +/- 0.16048   unsloth/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S

Edit: except for gemma-2-9b-it-GGUF

ngxson · 2025-05-26T08:44:43Z

I re-run the test and the ppl stays the same as my last comment.

Btw, just thinking, is it possible (and it is useful) to add a ppl test mode that uses the KV remove API?

ggerganov · 2025-05-26T08:56:33Z

I re-run the test and the ppl stays the same as my last comment.

The bartowski/gemma-2-9b-it-GGUF:Q4_K_M model produces the same PPL on master and on this PR with this command:

./bin/llama-perplexity -hf bartowski/gemma-2-9b-it-GGUF:Q4_K_M -f ./wikitext-2-raw/wiki.test.raw -c 16384 -fa --chunks 2 --swa-full

Maybe your reference value on master is outdated?

Btw, just thinking, is it possible (and it is useful) to add a ppl test mode that uses the KV remove API?

Can you clarify?

ngxson · 2025-05-26T09:00:45Z

I can't run the ppl rn, but if you get correct result, then I think yes could be a problem on my side.

Btw, just thinking, is it possible (and it is useful) to add a ppl test mode that uses the KV remove API?

Can you clarify?

Currently, AFAIU the ppl test simply evaluate text chunk by chunk, but only going forward. For example, if I have 3 chunks: 1-2-3, then they will be evaluated in the order of 1-2-3

But what we also what to test is for example:

Evaluate chunk 1, 2
Remove chunk 2 from memory
Evaluate chunk 2, 3

So I expect the ppl to be the same as just doing 1-2-3

slaren · 2025-05-26T16:42:43Z

How does this recover from a failed call to graph_compute? What is the replacement for commit/restore?

ggerganov · 2025-05-26T17:38:52Z

How does this recover from a failed call to graph_compute? What is the replacement for commit/restore?

There are some tricky scenarios in which we could have overwritten some of the data in the cache by the time the error occurs (i.e. processed the first few ubatches, but not all of them yet). Before (i.e. on master), we allowed to place ubatches only in empty slots, so we could simply mark the cells back to empty and recover in such cases. But with the new logic, this is no longer guaranteed because we allow to place ubatches in masked slots. This new logic is quite beneficial because it will enable smaller caches for SWA (i.e. n_swa + n_ubatch vs n_swa + n_batch) and also we don't have to explicitly prune SWA-masked tokens on successful batch, which allows to seamlessly do short rollbacks. The latter is needed for speculative decoding (#13747) and for cases where the last generated chat response can contain a few extra newlines, which are then discarded by the Web UI. In the latter case, if we pruned all tokens strictly by the SWA window (as it is currently on master), then this would cause full reprocessing of the context, while with the new logic, we can still rollback and have all necessary cache data available to reuse.

I think that on compute error, the KV cache should be assumed in an undefined state and the application should take necessary steps to recover (i.e. by clearing it and reprocessing the context that is currently needed). Later on, this reprocessing will become seamless, when we start storing the necessary tokens/embeddings information and add the logic for auto-reprocessing whatever is currently missing from the cache.

slaren · 2025-05-26T17:49:00Z

I am mostly concerned about the abort callback functionality. Errors in the backend are likely to be unrecoverable, but I am not sure if the abort functionality makes sense if it leaves the cache in a bad state.

ggerganov · 2025-05-26T17:50:19Z

I admit that I had completely forgotten about the abort callback. Let me see if we can do something about this.

ggerganov · 2025-05-27T13:32:26Z

Drafting for now as I want to do some more testing and think about the abort mechanism.

ggml-ci

gabe-l-hart · 2025-06-02T18:36:19Z

Wow, looks like some great progress on the list over the weekend! I see that #13904 got closed. I'm guessing that happened automatically when the target branch got deleted? I'll plan to update it on top of the latest changes unless I hear otherwise.

ggerganov · 2025-06-02T18:41:13Z

Yes, it seemed it was closed automatically. Feel free to update it and submit a new PR.

Btw, I think we should first fix the recurrent cache and also make the iSWA cache work with split_equal.

gabe-l-hart · 2025-06-02T18:43:06Z

Great, will do. I also saw that #13834 got moved to draft, so wasn't sure the status on that one. With all of the decoupling, I think the recurrent cache should be largely independent of the others (the only change it needs to recurrent is adding the layer filter), so I'm happy to take whatever order makes the most sense on your end.

KV caching and defragging have been completely overhauled since the last bump, so this patch no longer has a logical home. ggml-org/llama.cpp#13746 Branch: GraniteFour Signed-off-by: Gabe Goodhart <[email protected]>

github-actions bot added examples server labels May 24, 2025

ggerganov mentioned this pull request May 24, 2025

Feature Request: --swa-extra parameter needed to restore speculative decode function with SWA #13747

Closed

4 tasks

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from d23f887 to 8323e23 Compare May 24, 2025 14:06

Base automatically changed from gg/kv-cache-simplify-part2 to master May 25, 2025 13:34

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from c1434b8 to 1eec34a Compare May 25, 2025 13:42

ggerganov marked this pull request as ready for review May 25, 2025 14:50

ggerganov requested a review from ngxson as a code owner May 25, 2025 14:50

ggerganov requested a review from slaren May 25, 2025 14:52

ggerganov commented May 25, 2025

View reviewed changes

src/llama-kv-cache.h Outdated Show resolved Hide resolved

ggerganov commented May 25, 2025

View reviewed changes

src/llama-kv-cache.h Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from 3ef770f to 0b73da5 Compare May 26, 2025 11:32

ggerganov mentioned this pull request May 26, 2025

kv-cells : track min/max used cells and per-sequence positions #13808

Merged

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from 0b73da5 to 2252eef Compare May 27, 2025 13:11

ggerganov marked this pull request as draft May 27, 2025 13:32

ggerganov mentioned this pull request May 27, 2025

llama : initial Mamba-2 support #9126

Merged

9 tasks

gabe-l-hart mentioned this pull request May 27, 2025

Memory tests #13669

Closed

This was referenced May 30, 2025

Hybrid recurrent cache #13904

Closed

kv-cache : split implementation in separate sources #13920

Merged

ggerganov added 14 commits May 30, 2025 19:40

kv-cache : simplify the "struct llama_kv_cache" interface

773b6e3

ggml-ci

kv-cache : revert the (n_swa + n_ubatch) change (for next PR)

9fc50dc

ggml-ci

kv-cache : some comments

c2c3591

ggml-ci

context : fix graph reserve for multiple sequences

8856782

ggml-ci

kv-cache : fix typo [no ci]

bffb9d4

kv-cache : fix find_slot() logic for free slots

32cc9ea

ggml-ci

llama : add TODO for deprecating the defrag API in the future

f97de9b

kv-cache : improve find_slot() using min/max seq pos info

7764d91

ggml-ci

llama : handle aborts and compute errors

780bba9

ggml-ci

memory : extract state into llama_memory_state

dbcfa5f

ggml-ci

kv-cache : add comments

f2ded9d

ggml-ci

server : update batching logic to reset n_batch on successful decode

e230e51

server : upon full re-processing, remove the sequence from the cache

3cf5186

kv-cache : add TODO for doing split_equal when split_simple fails

71619f2

ggml-ci

ggerganov force-pushed the gg/kv-cache-simplify-part3 branch from f23e4cc to 71619f2 Compare May 31, 2025 07:05

ggerganov changed the title ~~kv-cache : simplify~~ kv-cache : refactor + add llama_memory_state_i May 31, 2025

ggerganov merged commit 12d0188 into master May 31, 2025
53 of 55 checks passed

ggerganov deleted the gg/kv-cache-simplify-part3 branch May 31, 2025 07:24

gabe-l-hart mentioned this pull request Jun 2, 2025

Hybrid recurrent cache #13979

Merged

This was referenced Jun 3, 2025

kv-cache : refactor the update/defrag mechanism #13988

Merged

context : fix pos_min initialization upon decode error #14008

Merged

compilade mentioned this pull request Jun 12, 2025

context : round n_tokens to next multiple of n_seqs when reserving #14140

Merged

ggerganov mentioned this pull request Jun 12, 2025

batch : rework llama_batch_allocr #14153

Merged

3 tasks

socram8888 mentioned this pull request Jul 17, 2025

Misc. bug: out of memory error after PR #13746 #14740

Closed


	// keep track of what the minimum sequence positions would be if we accept the ubatch
	llama_seq_id seq_pos_min[LLAMA_MAX_PARALLEL_SEQUENCES];
	for (int s = 0; s < LLAMA_MAX_PARALLEL_SEQUENCES; ++s) {
	seq_pos_min[s] = cells.seq_pos_min(s);
	}

	bool found = true;
	for (uint32_t i = 0; i < n_tokens; i++) {
	const llama_pos pos = ubatch.pos[i];
	const llama_seq_id seq_id = ubatch.seq_id[i][0];

	// can we use this cell? either:
	// - the cell is empty
	// - the cell is occupied only by one sequence:
	// - mask causally, if the sequence is the same as the one we are inserting
	// - mask SWA, using current max pos for that sequence in the cache
	// always insert in the cell with minimum pos
	bool can_use = cells.is_empty(head_cur + i);

	if (!can_use && cells.seq_count(head_cur + i) == 1) {
	const llama_pos pos_cell = cells.pos_get(head_cur + i);

	// causal mask
	if (cells.seq_has(head_cur + i, seq_id)) {
	can_use = pos_cell >= pos;
	}

	if (!can_use) {
	const llama_seq_id seq_id_cell = cells.seq_get(head_cur + i);

	// SWA mask
	if (pos_cell == seq_pos_min[seq_id_cell] &&
	is_masked_swa(pos_cell, cells.seq_pos_max(seq_id_cell) + 1)) {
	seq_pos_min[seq_id_cell]++;
	can_use = true;
	}
	}
	}

	if (!can_use) {
	found = false;
	head_cur += i + 1;
	n_tested += i + 1;
	break;
	}
	}

kv-cache : refactor + add llama_memory_state_i #13746

kv-cache : refactor + add llama_memory_state_i #13746

Uh oh!

Conversation

ggerganov commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

TODO

Next PRs

Uh oh!

ggerganov commented May 25, 2025

Uh oh!

Uh oh!

Uh oh!

ngxson commented May 25, 2025

Uh oh!

ggerganov commented May 25, 2025

Uh oh!

ngxson commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

ngxson commented May 26, 2025

Uh oh!

ggerganov commented May 26, 2025

Uh oh!

ngxson commented May 26, 2025

Uh oh!

slaren commented May 26, 2025

Uh oh!

ggerganov commented May 26, 2025

Uh oh!

slaren commented May 26, 2025

Uh oh!

ggerganov commented May 26, 2025

Uh oh!

ggerganov commented May 27, 2025

Uh oh!

Uh oh!

gabe-l-hart commented Jun 2, 2025

Uh oh!

ggerganov commented Jun 2, 2025

Uh oh!

gabe-l-hart commented Jun 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ggerganov commented May 24, 2025 •

edited

Loading

ngxson commented May 25, 2025 •

edited

Loading