Skip to content

Bug: llama.cpp server reports inaccurate n_ctx_per_seq?Β #10186

@horenbergerb

Description

@horenbergerb

What happened?

Running a model and specifying 8192 context like so:

/llama-server --model Mistral-Large-Instruct-2407-IQ3_XXS.gguf -c 8192 -ngl 35

Causes the following to print during initialization:

llama_new_context_with_model: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be
utilized

This freaked me out, because based on this discussion, the message implies that I'm actually only getting 4096 context due to parallelization. On the other hand, and I also see:

srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192                                                         slot        reset: id  0 | task -1 |

which is what I would expect.
This discrepancy seems to be due to the fact that the llama.cpp server temporarily increments n_parallel when loading the model (for a reason relating to Mamba? Not sure why we do this).
My concerns are:

  • What context is actually being used here? 8192 or 4096?
  • Should this be considered a bug, since the messages essentially contradict each other?

Please let me know if any other information is needed, but this should be easy to replicate. Thanks!

Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
version: 4033 (a9e8a9a0)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinglow severityUsed to report low severity bugs in llama.cpp (e.g. cosmetic issues, non critical UI glitches)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions