Skip to content

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Aug 26, 2025

Print stats for compute buffers and graph nodes with memory-less contexts (such as for embedding models).

Example:

llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -p "test" -c 512 -b 512 -ub 512
0.00.379.367 I llama_context: constructing llama_context
0.00.379.370 I llama_context: n_seq_max     = 1
0.00.379.370 I llama_context: n_ctx         = 512
0.00.379.370 I llama_context: n_ctx_per_seq = 512
0.00.379.370 I llama_context: n_batch       = 512
0.00.379.370 I llama_context: n_ubatch      = 512
0.00.379.371 I llama_context: causal_attn   = 0
0.00.379.371 I llama_context: flash_attn    = 0
0.00.379.371 I llama_context: kv_unified    = true
0.00.379.371 I llama_context: freq_base     = 10000.0
0.00.379.372 I llama_context: freq_scale    = 1
0.00.379.372 I ggml_metal_init: allocating
0.00.379.401 I ggml_metal_init: found device: Apple M2 Ultra
0.00.379.404 I ggml_metal_init: picking default device: Apple M2 Ultra
0.00.379.997 I ggml_metal_load_library: using embedded metal library
0.00.384.503 I ggml_metal_init: GPU name:   Apple M2 Ultra
0.00.384.507 I ggml_metal_init: GPU family: MTLGPUFamilyApple8  (1008)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
0.00.384.508 I ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
0.00.384.508 I ggml_metal_init: simdgroup reduction   = true
0.00.384.509 I ggml_metal_init: simdgroup matrix mul. = true
0.00.384.509 I ggml_metal_init: has residency sets    = true
0.00.384.509 I ggml_metal_init: has bfloat            = true
0.00.384.509 I ggml_metal_init: use bfloat            = true
0.00.384.510 I ggml_metal_init: hasUnifiedMemory      = true
0.00.384.511 I ggml_metal_init: recommendedMaxWorkingSetSize  = 154618.82 MB
0.00.405.574 I llama_context:        CPU  output buffer size =     0.12 MiB
0.00.406.905 I llama_context:      Metal compute buffer size =    16.75 MiB
0.00.406.909 I llama_context:        CPU compute buffer size =     2.51 MiB
0.00.406.909 I llama_context: graph nodes  = 431
0.00.406.910 I llama_context: graph splits = 2

@ggerganov ggerganov requested a review from danbev August 26, 2025 09:12
@ggerganov ggerganov merged commit 85cc1ae into master Aug 26, 2025
53 of 56 checks passed
@ggerganov ggerganov deleted the gg/context-print-stats-no-mem branch August 26, 2025 09:47
Minh141120 pushed a commit to menloresearch/llama.cpp that referenced this pull request Aug 27, 2025
@LostRuins
Copy link
Collaborator

Hello, just wanted to point out an observation that after this PR, the GPU memory usage of the embedding model bge-m3-q8_0.gguf has increased by about 4gb when used with 8k context.

Does not seem to affect other embedding models, and bge seemed to have worked fine previously.

The llama-context.cpp file has been modified many times since, but this PR is where the regression first started.

It's still working fine, and by reverting

if (!hparams.vocab_only) {
back to if (!hparams.vocab_only && memory) {, I seem to be able to generate my embeddings just fine without the extra memory overhead. So I am guessing there are some unnecessary allocations when it comes to BGE?

@ggerganov
Copy link
Member Author

@LostRuins This issue will be addressed after we accomplish #16148

@ggerganov
Copy link
Member Author

@LostRuins If you can give this PR #16528 a try and see if you spot any issues. It should reduce memory usage significantly when running embedding models.

@LostRuins
Copy link
Collaborator

Hi @ggerganov , I tested this new PR, I don't notice any improvement for memory usage when testing bge-m3-q8_0.gguf with it. It seems to be working, but resource usage seems to be the same as before.

That is, if I revert the change in llama-context.cpp

-    if (!hparams.vocab_only && memory) {
+    if (!hparams.vocab_only) {

then it still takes up 4.5gb just by loading the model (without even decoding anything). Without it, it only utilizes 0.1gb of VRAM.

Actual additional memory usage during processing is very low in both cases. Generating a 2000 token embedding takes about 0.5gb extra VRAM, both before and after this PR.

@ggerganov
Copy link
Member Author

ggerganov commented Oct 12, 2025

Thanks for testing. This will be addressed in #16531

@ggerganov
Copy link
Member Author

@LostRuins Can you provide a command with llama-embedding that reproduces the high memory usage?

@LostRuins
Copy link
Collaborator

Hi @ggerganov , using llama-embedding will make observing the GPU usage require a little tweaking since it terminates immediately after it's done.

So for this repro, I will add a 10 second sleep right after the embedding model is loaded in embeddings.cpp. In this test, I am using the Vulkan backend.

    // load the model
    common_init_result llama_init = common_init_from_params(params);
    std::this_thread::sleep_for(std::chrono::milliseconds(10000));

Now simply use this CLI:

llama-embedding.exe -m bge-m3-q8_0.gguf -fa off -c 8192 -p "Hello world this is a test"

Result:
image

Now with the old patch, using the same CLI flags:

+    if (!hparams.vocab_only && memory) {
-    if (!hparams.vocab_only) {

Result:
image

@ggerganov
Copy link
Member Author

@LostRuins Excess memory usage should be fixed on master. Make sure to enable FA though.

@LostRuins
Copy link
Collaborator

Alright thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants