context : print graph stats for memory-less contexts #15586

ggerganov · 2025-08-26T09:12:16Z

Print stats for compute buffers and graph nodes with memory-less contexts (such as for embedding models).

Example:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -p "test" -c 512 -b 512 -ub 512

0.00.379.367 I llama_context: constructing llama_context
0.00.379.370 I llama_context: n_seq_max     = 1
0.00.379.370 I llama_context: n_ctx         = 512
0.00.379.370 I llama_context: n_ctx_per_seq = 512
0.00.379.370 I llama_context: n_batch       = 512
0.00.379.370 I llama_context: n_ubatch      = 512
0.00.379.371 I llama_context: causal_attn   = 0
0.00.379.371 I llama_context: flash_attn    = 0
0.00.379.371 I llama_context: kv_unified    = true
0.00.379.371 I llama_context: freq_base     = 10000.0
0.00.379.372 I llama_context: freq_scale    = 1
0.00.379.372 I ggml_metal_init: allocating
0.00.379.401 I ggml_metal_init: found device: Apple M2 Ultra
0.00.379.404 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.379.997 I ggml_metal_load_library: using embedded metal library
0.00.384.503 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.384.507 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.384.508 I ggml_metal_init: simdgroup reduction   = true
0.00.384.509 I ggml_metal_init: simdgroup matrix mul. = true
0.00.384.509 I ggml_metal_init: has residency sets    = true
0.00.384.509 I ggml_metal_init: has bfloat            = true
0.00.384.509 I ggml_metal_init: use bfloat            = true
0.00.384.510 I ggml_metal_init: hasUnifiedMemory      = true
0.00.384.511 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.405.574 I llama_context:        CPU  output buffer size =     0.12 MiB
0.00.406.905 I llama_context:      Metal compute buffer size =    16.75 MiB
0.00.406.909 I llama_context:        CPU compute buffer size =     2.51 MiB
0.00.406.909 I llama_context: graph nodes  = 431
0.00.406.910 I llama_context: graph splits = 2

ggml-ci

LostRuins · 2025-09-21T00:37:20Z

Hello, just wanted to point out an observation that after this PR, the GPU memory usage of the embedding model bge-m3-q8_0.gguf has increased by about 4gb when used with 8k context.

Does not seem to affect other embedding models, and bge seemed to have worked fine previously.

The llama-context.cpp file has been modified many times since, but this PR is where the regression first started.

It's still working fine, and by reverting

llama.cpp/src/llama-context.cpp

Line 273 in 7f76692

if (!hparams.vocab_only) {

back to if (!hparams.vocab_only && memory) {, I seem to be able to generate my embeddings just fine without the extra memory overhead. So I am guessing there are some unnecessary allocations when it comes to BGE?

ggerganov · 2025-09-22T13:01:21Z

@LostRuins This issue will be addressed after we accomplish #16148

ggerganov · 2025-10-12T07:47:56Z

@LostRuins If you can give this PR #16528 a try and see if you spot any issues. It should reduce memory usage significantly when running embedding models.

LostRuins · 2025-10-12T09:16:09Z

Hi @ggerganov , I tested this new PR, I don't notice any improvement for memory usage when testing bge-m3-q8_0.gguf with it. It seems to be working, but resource usage seems to be the same as before.

That is, if I revert the change in llama-context.cpp

-    if (!hparams.vocab_only && memory) {
+    if (!hparams.vocab_only) {

then it still takes up 4.5gb just by loading the model (without even decoding anything). Without it, it only utilizes 0.1gb of VRAM.

Actual additional memory usage during processing is very low in both cases. Generating a 2000 token embedding takes about 0.5gb extra VRAM, both before and after this PR.

ggerganov · 2025-10-12T10:20:04Z

Thanks for testing. This will be addressed in #16531

ggerganov · 2025-10-12T17:58:14Z

@LostRuins Can you provide a command with llama-embedding that reproduces the high memory usage?

LostRuins · 2025-10-13T03:45:13Z

Hi @ggerganov , using llama-embedding will make observing the GPU usage require a little tweaking since it terminates immediately after it's done.

So for this repro, I will add a 10 second sleep right after the embedding model is loaded in embeddings.cpp. In this test, I am using the Vulkan backend.

    // load the model
    common_init_result llama_init = common_init_from_params(params);
    std::this_thread::sleep_for(std::chrono::milliseconds(10000));

Now simply use this CLI:

llama-embedding.exe -m bge-m3-q8_0.gguf -fa off -c 8192 -p "Hello world this is a test"

Result:

Now with the old patch, using the same CLI flags:

+    if (!hparams.vocab_only && memory) {
-    if (!hparams.vocab_only) {

Result:

ggerganov · 2025-10-14T12:07:06Z

@LostRuins Excess memory usage should be fixed on master. Make sure to enable FA though.

LostRuins · 2025-10-14T12:15:50Z

Alright thanks

context : print graph stats for memory-less contexts

7ccfe75

ggml-ci

ggerganov requested a review from danbev August 26, 2025 09:12

danbev approved these changes Aug 26, 2025

View reviewed changes

ggerganov merged commit 85cc1ae into master Aug 26, 2025
53 of 56 checks passed

ggerganov deleted the gg/context-print-stats-no-mem branch August 26, 2025 09:47

ggerganov mentioned this pull request Aug 26, 2025

graph : fix assert in memory-less build_attn #15590

Merged

Minh141120 pushed a commit to janhq/llama.cpp that referenced this pull request Aug 27, 2025

context : print graph stats for memory-less contexts (ggml-org#15586)

88e1081

ggml-ci

ggerganov mentioned this pull request Sep 21, 2025

metal : add support for non-padded FA KV #16148

Merged

ggerganov mentioned this pull request Oct 12, 2025

metal : FA support F32 K and V #16531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

context : print graph stats for memory-less contexts #15586

context : print graph stats for memory-less contexts #15586

Uh oh!

ggerganov commented Aug 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

LostRuins commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

LostRuins commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025 •

edited

Loading

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

LostRuins commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 14, 2025

Uh oh!

LostRuins commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

context : print graph stats for memory-less contexts #15586

context : print graph stats for memory-less contexts #15586

Uh oh!

Conversation

ggerganov commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

LostRuins commented Sep 21, 2025

Uh oh!

ggerganov commented Sep 22, 2025

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

LostRuins commented Oct 12, 2025

Uh oh!

ggerganov commented Oct 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Oct 12, 2025

Uh oh!

LostRuins commented Oct 13, 2025

Uh oh!

ggerganov commented Oct 14, 2025

Uh oh!

LostRuins commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ggerganov commented Aug 26, 2025 •

edited

Loading

ggerganov commented Oct 12, 2025 •

edited

Loading