- 
                Notifications
    You must be signed in to change notification settings 
- Fork 13.4k
          llama-bench : Add --override-tensors arg
          #12922
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| Sketchy performance comparison on my laptop to show why  My hardware is an ASUS TUF A14 gaming laptop, so a Ryzen 9 AI HX 370 with 7500MHz LPDDR5 and an RTX 4060 Mobile. I run it for these tests in the ASUS-standard "Turbo" mode. First, a CPU-only test on my hardware (used 0.3 GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 0 -p 4096 -n 4096
 Next, running with  .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 4 -p 4096 -n 4096
 Next, enabling the  .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf -t 8 -ngl 99 -ot "\d+\.ffn_.*exp.=CPU" -p 4096 -n 4096
 Effects are significantly more pronounced in larger MoE models, especially with more experts and some experts that are re-used for every pass (e.g. Llama 4 Scout and Maverick, although those models are beyond my devices' capabilities.) I tried to demonstrate with Deepseek-V2-Lite, but ran into CUDA errors if I tried to apply flash attention, cache quantization, or override-tensors. I don't have the experience with llama.cpp's codebase to track those down, but another Beaver has suggested it may be related to #12798 | 
--override-tensors option to llama-bench--override-tensors arg
      | PR #12891 has resolved my issue running flash attention and override-tensors with Deepseek-V2-Lite. Some performance numbers for that, same hardware as my last set: CPU Only (Used 0.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
    -p 4096 -n 4096 -t 8 ^
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 0
 Completely Filled GPU (Used 8.0GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
    -p 4096 -n 4096 -t 8 ^
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 14
 Comparable VRAM GPU (Used 2.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
    -p 4096 -n 4096 -t 8 ^
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 4
 Override-Tensors Run (Used 1.8GB of VRAM during prompt processing) .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
    -p 4096 -n 4096 -t 8 ^
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "\d+\.ffn_.*exp.=CPU" 
 Tuned Override-Tensors (Used 6.3GB of VRAM during prompt processing) This run, I'm leaving 6 of the 26 layers' conditional experts on the GPU as well as all the  .\build\bin\Release\llama-bench.exe -m ..\models\DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf ^
    -p 4096 -n 4096 -t 8 ^
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "[12]\d\.ffn_.*exps.=CPU" 
 Turns out my GPU was far more underpowered than I expected, but y'all can see the point of being able to benchmark this kind of thing. | 
| Ran another set of experiments on another device (RTX 3070 and an AMD Ryzen 7 5800X 8-Core with two sticks of 2133MHz DDR4) CPU Only (Used 836MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
    -p 4096 -n 4096 -t 4 \
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 0
 Full GPU (Used 7626MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
    -p 4096 -n 4096 -t 4 \
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 13
 Comparable VRAM GPU (Used 2930MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
    -p 4096 -n 4096 -t 4 \
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 4
 Override-Tensors Full CPU Experts (except shared) (Used 2276MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
    -p 4096 -n 4096 -t 4 \
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "\d+.ffn_.*exps.=CPU"
 Override-Tensors Tuned (Used 7034MB of VRAM during prompt processing) ./build/bin/llama-bench -m ../models/DeepSeek-Coder-V2-Lite-Base-Q6_K_L.gguf \
    -p 4096 -n 4096 -t 4 \
    -fa 1 -ctk q8_0 -ctv q8_0 -ngl 99 -ot "[2.]\d.ffn_.*exps.=CPU"
 Now, as the processor doesn't have AVX512 and relatively high bandwidth memory, we see the GPU eeking out a performance boost and override-tensors helping significantly. | 
| You can also use this to offload the entire KV cache to GPU while keeping the model on CPU:  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, the implementation looks good. However, I don't think this should be a global parameter. Even if only for consistency, this should be part of the test grid so that different values can be tested at the same time. The values should be saved and shown in the test results, like any other parameter.
| Got it @slaren. As for splitting the test grid entries, would you prefer that I use semicolons instead of commas the same way that we do for tensor split? Or should I reverse their behavior? Or should I require separate instances of the  EDIT: Going to update the PR with the same behaviour as tensor split for now, just so that I can get started. | 
| Either way is fine as long as it is consistent, I don't mind if the way  | 
…t, appear in test matrix.
| I've implemented the behaviour the same way as tensor-split, for now. That is,  .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf ^
    -t 6 -ts 1;0 -pg 2048,128 ^
    -ngl 99 -ot "\d+\.ffn_.*exp.=CPU,1\d\.ffn_.*exp.=CPU,1\d\.ffn_.*exps.=CPU"
 .\build\bin\Release\llama-bench.exe -m ..\models\OLMoE-1B-7B-0924-Instruct-Q8_0.gguf ^
    -t 6 -ts 1;0 -pg 2048,128 ^
    -ngl 99 -ot "\d+\.ffn_.*exp.=CPU" -ot "1\d\.ffn_.*exp.=CPU" -ot "1\d\.ffn_.*exps.=CPU"
 | 
| I understand now why all of the other functions in that file were marked  | 
| All I can say is the CPU CI ran to completion on my Ubuntu 22.04 machine with no errors I was aware of. I'll try to take a look at this again tomorrow or Friday. What's the minimum NVCC version for the CUDA CI? I have CUDA toolkit 12.4.131, which the CMake configuration finds both in and out of CI, and builds happily when I build llama.cpp for my own purposes, but while "Compiling the CUDA compiler identification source file "CMakeCUDACompilerId.cu" failed" I get  | 
| Tried the Vulkan CI (because I can't run the CUDA CI on my desktop with my nvcc, apparently) and that failed on an unused parameter in a file my change didn't even touch, both before and after merging the latest master: Let me know if any next-steps you have for me. I'd love to be able to test this properly locally, rather than hitting the GitHub CI. I don't see any errors from CPU CI when run locally, and I'm unable to run the CUDA and Vulkan CI, so I'm not sure what other actions I can take to make this small PR compatible (beyond bringing it up to the latest master, as I've just done.) | 
| The minimum CUDA version is 11.7, you should be good with 12.4. I am not sure what happened there, it seems that it failed to pick which architecture to build for. You could try manually specifying the architecture by setting  | 
| Adding  Digging further into the online CI failures, I'm noticing they're only occurring to  | 
| I was able to run the CUDA CI on my x86_64 laptop's 4060 using CUDA 12.8 installed in WSL2 Ubuntu 24.04. No errors. Are you able to re-run the failed GitHub checks to give the package list retrieval from Azure a chance of success? I can't determine any connection it would have to this PR. | 
| Don't worry about the failed linux-cross CI, it's not related to this PR. I will review this again when I have a chance. | 
| This would be a really useful addition for benchmarking  | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue where running llama-bench without -ot causes it to not run any tests at all, since tensor_buft_overrides is empty it breaks the enumeration when constructing the list of tests to run.
…ling empty -ot spans, etc.)
| @4onen Hi, I’m very interested in the --override-tensors parameter as well. Could you please share any additional public documentation or resources I could review? 
 | 
* Add --override-tensors option to llama-bench * Correct llama-bench --override-tensors to --override-tensor * llama-bench: Update --override-tensors parsing to match --tensor-split, appear in test matrix. * Make new llama-bench util functions static to fix Ubuntu CI * llama-bench: Correct -ot corner cases (No -ot calls, leading and trailing empty -ot spans, etc.)
| 
 Hey @jklincn, I'm afraid I'm a little too hammered with my school & research work to make progress on that. I asked around and the best I'm currently aware of is #11397 (the original PR for the feature.) Not terribly detailed but makes clear how to use regex to select layers. On HuggingFace you can get a listing of all the layers in the model to help you build the regex you need. Good luck! | 
A small group over at BeaverAI have been making extensive use of the
--override-tensors(-ot) flag for running massive MOE models faster by keeping attention on the GPU and offloading the expert FFNs to the CPU. Informal experimentation inllama-serverorllama-clidoesn't compare to the properllama-bench, though, so this PR adds the--override-tensorsarg (and the-otshort form) tollama-bench.I noticed the
// FIXMEabout leaking memory inargs.cppwhen copying the--override-tensorsargument parsing, and chose to stamp null terminators into theargv, rather than accept the memory leak, asllama-benchcallsparse_cmd_paramsonly once. Let me know if you'd like that swapped out for the memory-leaking version from the common arg parser, as it's only a handful of user-entered bytes leaked.Also planning to do some documentation of
--override-tensorsa little later on, as it's proving very useful and we'd love to spread the word.