server : support unified cache across slots #16736

ggerganov · 2025-10-23T09:31:48Z

Current logic in this PR (subject to change):

When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
When we run out of space, try to free some by purging old sequences from idle slots, one by one, in no particular order
If we still run out of space, terminate all active slots at once
The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context
By default, start the server using 4 slots and unified KV cache
llama_context now caps the n_ctx_seq to a maximum of hparams.n_ctx_train

Example:

llama-server -m model.gguf -c 8192 --jinja

TODO:

When we run out of space, terminate the active slots one-by-one and keep trying
~~Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391~~
Add tests

Future improvements:

When run out of space, terminate slots one by one instead of all together
Update logic for starting a new task to check that it has some extra room for generation (not very sure if needed, current logic will simply purge one of the other slots, so it should be good as it is)

slaren · 2025-10-23T13:46:10Z

src/llama-context.cpp


 uint32_t llama_context::n_ctx_per_seq() const {
-    return cparams.n_ctx / cparams.n_seq_max;
+    return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;


Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change

return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;

return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);

uint32_t n_ctx_per_stream = cparams.n_ctx;

if (!cparams.kv_unified) {

n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;

} else {

n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

cparams.n_ctx = n_ctx_per_stream;

}

LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

This is now rebased on top of the changes in #16812. The result is that we determine the KV cache size during context creation and there should be no leftover KV cells.

Note that since we now cap the context size to the training context size, the user code is recommended to query llama_n_ctx and llama_n_ctx_seq after creating the llama_context in order to obtain the actual context size. I'll add comments in llama.h to reflect this.

Will try to clean-up this PR next and will open it for review when ready.

ggerganov · 2025-10-29T14:14:57Z

src/llama-context.cpp

+    if (cparams.n_ctx_seq > hparams.n_ctx_train) {
+        LLAMA_LOG_WARN("%s: n_ctx_seq (%u) > n_ctx_train (%u) -- possible training context overflow\n",
+                __func__, cparams.n_ctx_seq, hparams.n_ctx_train);


This branch should not be reached due to the capping above on line 117. But keeping it in case the capping logic gets changed in the future.

ggerganov · 2025-10-30T18:41:43Z

Ready for review. I've marked some TODOs for follow-up PRs since I think the current implementation is quite basic and at the same time gets us 90% on the way to the ideal logic. Will improve the rest of the cases from master.

github-actions bot added examples server labels Oct 23, 2025

slaren reviewed Oct 23, 2025

View reviewed changes

github-actions bot added the python python script changes label Oct 23, 2025

ggerganov mentioned this pull request Oct 28, 2025

memory : remove KV cache size padding #16812

Merged

ggerganov force-pushed the gg/server-unified-slots branch 4 times, most recently from 55bb9db to 6369fe0 Compare October 28, 2025 10:50

github-actions bot added the testing Everything test related label Oct 28, 2025

ggerganov force-pushed the gg/server-unified-slots branch from 6369fe0 to ac261be Compare October 29, 2025 14:13

ggerganov commented Oct 29, 2025

View reviewed changes

ggerganov force-pushed the gg/server-unified-slots branch from ac261be to 0ba88d3 Compare October 30, 2025 16:52

ggerganov added 7 commits October 30, 2025 18:56

server : support unified context across slots

4dcf0a6

cont : fix speculative decoding initialization

2ec7cda

context : fix n_ctx_per_seq computation

d61018f

server : purge slots one by one

6089e08

tests : add unified cache server tests

f3d1607

llama : update per-seq context computation

2ca720c

test-thread-safety : handle tiny training context of the input model

4e9e319

ggerganov force-pushed the gg/server-unified-slots branch from 0ba88d3 to 4e9e319 Compare October 30, 2025 17:01

ggerganov added 4 commits October 30, 2025 20:15

server : fix server_tokens clear()

a5d27aa

server : use 4 slots + unified KV by default

2d69109

llama : add note about context size queries

9d34299

cont : update todos [no ci]

93373cc

ggerganov marked this pull request as ready for review October 30, 2025 18:39

ggerganov requested review from CISC and ngxson as code owners October 30, 2025 18:39

ggerganov requested a review from slaren October 30, 2025 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : support unified cache across slots #16736

server : support unified cache across slots #16736

ggerganov commented Oct 23, 2025 •

edited

Loading

Uh oh!

slaren Oct 23, 2025

Uh oh!

ggerganov Oct 23, 2025

Uh oh!

slaren Oct 23, 2025

Uh oh!

ggerganov Oct 23, 2025

Uh oh!

ggerganov Oct 29, 2025

Uh oh!

ggerganov Oct 29, 2025

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
	return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

	const auto padding = llama_kv_cache::get_padding(cparams);

	uint32_t n_ctx_per_stream = cparams.n_ctx;

	if (!cparams.kv_unified) {
	n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
	} else {
	n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);

	cparams.n_ctx = n_ctx_per_stream;
	}

	LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

server : support unified cache across slots #16736

Are you sure you want to change the base?

server : support unified cache across slots #16736

Conversation

ggerganov commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

slaren Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Oct 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Oct 23, 2025 •

edited

Loading