-
Notifications
You must be signed in to change notification settings - Fork 13.3k
context : print graph stats for memory-less contexts #15586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hello, just wanted to point out an observation that after this PR, the GPU memory usage of the embedding model Does not seem to affect other embedding models, and bge seemed to have worked fine previously. The It's still working fine, and by reverting llama.cpp/src/llama-context.cpp Line 273 in 7f76692
if (!hparams.vocab_only && memory) { , I seem to be able to generate my embeddings just fine without the extra memory overhead. So I am guessing there are some unnecessary allocations when it comes to BGE?
|
@LostRuins This issue will be addressed after we accomplish #16148 |
@LostRuins If you can give this PR #16528 a try and see if you spot any issues. It should reduce memory usage significantly when running embedding models. |
Hi @ggerganov , I tested this new PR, I don't notice any improvement for memory usage when testing That is, if I revert the change in
then it still takes up 4.5gb just by loading the model (without even decoding anything). Without it, it only utilizes 0.1gb of VRAM. Actual additional memory usage during processing is very low in both cases. Generating a 2000 token embedding takes about 0.5gb extra VRAM, both before and after this PR. |
Thanks for testing. This will be addressed in #16531 |
@LostRuins Can you provide a command with |
Hi @ggerganov , using So for this repro, I will add a 10 second sleep right after the embedding model is loaded in
Now simply use this CLI:
Now with the old patch, using the same CLI flags:
|
@LostRuins Excess memory usage should be fixed on |
Alright thanks |
Print stats for compute buffers and graph nodes with memory-less contexts (such as for embedding models).
Example:
llama-embedding -hf ggml-org/bge-small-en-v1.5-Q8_0-GGUF -p "test" -c 512 -b 512 -ub 512