Skip to content

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Oct 23, 2025

ref #4130 (reply in thread)

Current logic in this PR (subject to change):

  • When using unified KV cache with -kvu, share the entire context -c N among all parallel slots of the server -np N
  • When we run out of space, try to free some by purging old sequences from idle slots, one by one, in no particular order
  • If we still run out of space, terminate all active slots at once
  • The -np N argument is still utilized to control the max number of parallel jobs, but it is no longer used to change the per-slot context
  • By default, start the server using 4 slots and unified KV cache
  • llama_context now caps the n_ctx_seq to a maximum of hparams.n_ctx_train

Example:

llama-server -m model.gguf -c 8192 --jinja

TODO:

  • When we run out of space, terminate the active slots one-by-one and keep trying
  • Think about instead of purging, to move the slot into host-memory cache. Not sure that this is really needed thanks to the existing logic from server : host-memory prompt caching #16391
  • Add tests

Future improvements:

  • When run out of space, terminate slots one by one instead of all together
  • Update logic for starting a new task to check that it has some extra room for generation (not very sure if needed, current logic will simply purge one of the other slots, so it should be good as it is)


uint32_t llama_context::n_ctx_per_seq() const {
return cparams.n_ctx / cparams.n_seq_max;
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this value be capped when using unified cache to avoid exceeding the model context length? I think it could be set to min(n_ctx_train, n_ctx), or add a parameter to allow the user to change it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we can cap it to n_ctx_train. The only use case for n_ctx > n_ctx_train that comes to mind is self-extend, but lately this technique seems less relevant.

We can also cap it for the non-unified case?

Suggested change
return cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max;
return stdd:min(n_ctx_train, cparams.kv_unified ? cparams.n_ctx : cparams.n_ctx / cparams.n_seq_max);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also cap it for the non-unified case?

What would happen to the leftover slots? I may be misunderstanding the way split cache works, but my assumption would be that these slots would never be used, and it would be wasted memory. So if that's capped, it should be done at context creation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should do the capping at context creation in the llama_context constructor. Currently we have some additional logic for this in llama-model:

llama.cpp/src/llama-model.cpp

Lines 19708 to 19724 in 7863fcc

const auto padding = llama_kv_cache::get_padding(cparams);
uint32_t n_ctx_per_stream = cparams.n_ctx;
if (!cparams.kv_unified) {
n_ctx_per_stream = (cparams.n_ctx + cparams.n_seq_max - 1)/cparams.n_seq_max;
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream*cparams.n_seq_max;
} else {
n_ctx_per_stream = GGML_PAD(n_ctx_per_stream, padding);
cparams.n_ctx = n_ctx_per_stream;
}
LLAMA_LOG_DEBUG("%s: n_ctx = %u (padded)\n", __func__, cparams.n_ctx);

Since we no longer need the padding logic (as of #16148 and related) we should simplify this.

I'll push a separate PR for this and then will come back to polishing this one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now rebased on top of the changes in #16812. The result is that we determine the KV cache size during context creation and there should be no leftover KV cells.

Note that since we now cap the context size to the training context size, the user code is recommended to query llama_n_ctx and llama_n_ctx_seq after creating the llama_context in order to obtain the actual context size. I'll add comments in llama.h to reflect this.

Will try to clean-up this PR next and will open it for review when ready.

@github-actions github-actions bot added the python python script changes label Oct 23, 2025
@ggerganov ggerganov force-pushed the gg/server-unified-slots branch 4 times, most recently from 55bb9db to 6369fe0 Compare October 28, 2025 10:50
@github-actions github-actions bot added the testing Everything test related label Oct 28, 2025
@ggerganov ggerganov force-pushed the gg/server-unified-slots branch from 6369fe0 to ac261be Compare October 29, 2025 14:13
Comment on lines +139 to +151
if (cparams.n_ctx_seq > hparams.n_ctx_train) {
LLAMA_LOG_WARN("%s: n_ctx_seq (%u) > n_ctx_train (%u) -- possible training context overflow\n",
__func__, cparams.n_ctx_seq, hparams.n_ctx_train);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch should not be reached due to the capping above on line 117. But keeping it in case the capping logic gets changed in the future.

@ggerganov ggerganov force-pushed the gg/server-unified-slots branch from ac261be to 0ba88d3 Compare October 30, 2025 16:52
@ggerganov ggerganov force-pushed the gg/server-unified-slots branch from 0ba88d3 to 4e9e319 Compare October 30, 2025 17:01
@ggerganov ggerganov marked this pull request as ready for review October 30, 2025 18:39
@ggerganov ggerganov requested review from CISC and ngxson as code owners October 30, 2025 18:39
@ggerganov
Copy link
Member Author

Ready for review. I've marked some TODOs for follow-up PRs since I think the current implementation is quite basic and at the same time gets us 90% on the way to the ideal logic. Will improve the rest of the cases from master.

@ggerganov ggerganov requested a review from slaren October 30, 2025 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants