Replies: 1 comment 1 reply
-
Having said all that, token generation speed in the case of CPU-only or hybrid GPU/CPU inference is limited by CPU memory bandwidth, so performance gains compared to mainline After you get going with Unsloth's quantized models, you may also want to look into some of the quantized models with |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
First of all thank you very much for your contribution in quantization which helps GPU poor people like us to enjoy LLM's :-)) . I recently compiled llama.cpp with these commands :
cmake -B build \ -DCMAKE_BUILD_TYPE=Release \ -DGGML_CUDA=ON \ -DCMAKE_CUDA_ARCHITECTURES="89" \ -DGGML_CUDA_F16=ON \ -DGGML_CUDA_FA_ALL_QUANTS=ON \ -DGGML_BLAS=ON \ -DGGML_BLAS_VENDOR=OpenBLAS \ -DLLAMA_LLGUIDANCE=ON \
cmake --build build --config Release -j
I have RTX 4060 8GB VRAM, so i asked gemini 2.5 pro latest to guide me. I feeded him all docs context with project gitingest and then i asked it to generate best build command and it did which i pasted above, so do let me know if i have to make some more changes or not, because i used same commands to build the fork version (this project).
I get same speed in both llama.cpp version and this fork version. I used following command to run model.
GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-server --device CUDA0 \ -m ~/models/Qwen3-30B-A3B-128K-UD-Q2_K_XL.gguf \ -c 32000 \ -ngl 48 \ -t 4 \ -ot '.*\.ffn_down_exps\.weight=CPU' \ -ot '.*\.ffn_up_exps\.weight=CPU' \ -ub 256 -b 512 \ --host 0.0.0.0 \ --port 8009 \ --flash-attn \ --cache-type-k q8_0 \ --cache-type-v q8_0 \
I am getting 20-23 token/s , so i wanted to know if i can improve it further with re compiling or you can guide me to improve this command further. I am asking for much more improvement because i want to go for IQ3_XXS Quant which people reported works great and that's will be my end limit.
Beta Was this translation helpful? Give feedback.
All reactions