-
Notifications
You must be signed in to change notification settings - Fork 155
Perhaps slightly faster trellis quants #541
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
My usual library spot was closed today so sitting outside in the sun trying to grab some quick llama-sweep-bench numbers:
So on the AMD Thread Ripper Pro seeing improvementn from 8.61 up to 10.58 TG tok/sec so 1.229x improvement speedup! Great considering this is also using CUDA offload. llama-sweep-bench command./build/bin/llama-sweep-bench \
--model "$model" \
--no-mmap \
--ctx-size 8704 \
-ctk f16 \
-mla 3 -fa \
-fmoe \
-amb 512 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
-ot "blk\.(10|11|12|13|14|15|16)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--warmup-batch \
--threads 24 main: n_kv_max = 8704, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24 version: 3764 (9320993) (PR541)
version: 3761 (144ee1c) (main)main: n_kv_max = 8704, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 99, n_threads = 24, n_threads_batch = 24
I'll try to get some numbers on the big 6980P pure-CPU soon! |
Super quick, on the 6980P, using my numbers from yesterday for the
I have to juggle files to get that R1-0528-IQ3_KT onto the big rig, and will give more results when I find some time. tl;dr; Definitely looking better already! Great job! |
Okay, back at a desk with my laptop for a little while. Here is a quick comparison for a mixed R1-0528-IQ3_KT quant.
Given not every tensor is I spot checked using less threads for TG and it was slower, so using Finally, I didn't expect this, but it seems like PP increased a lot as well!!?? At the default batch size PP went from 36.75 up to 117.38, a ~3.19x speedup!!? I didn't track the code-path to see if the new avx512 and other code is used for PP as well as TG? The effect is not as dramatic at higher batch sizes, but still holds as being faster at ub 4096. No graphs, tonight, but some data in the fold below showing the effect. 👈 llama-sweep-bench command and datamodel=/mnt/raid/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ3_KT/DeepSeek-R1-0528-IQ3_KT-00001-of-00006.gguf
# adjust -ub 2048 -b 2048
# also adjust -c to large enough for batch size or it will segfault out
numactl -N 0 -m 0 \
./build/bin/llama-sweep-bench \
--model "$model" \
-c 1536 \
-ctk q8_0 \
-mla 3 -fa \
-fmoe \
--no-mmap \
--threads 128 \
--threads-batch 128 \
--numa numactl \
--warmup-batch main@144ee1c4main: n_kv_max = 1536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
main: n_kv_max = 6144, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
main: n_kv_max = 12288, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
PR541@93209939main: n_kv_max = 1536, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
main: n_kv_max = 6144, n_batch = 2048, n_ubatch = 2048, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
main: n_kv_max = 12288, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = -1, n_threads = 128, n_threads_batch = 128
fwiw here is the output of 👈 6980P CPU flags
Thanks! |
Confirmed for me for IQ3_KT. Llama 8b. Before patch : TG 3.27 t/s. Rig : Ryzen 5700G, AVX2, 4*8GB DDR4 2666mhz, on 8 threads, BBS 128, prompt 1024 tokens, then 100 tokens generated. |
Thank you for testing!
This is not supposed to happen. It is a mixture of experts, so the new path can get invoked when an expert ends up processing fewer than (currently) 32 tokens. But at least on my end this works fine, even if I disable the repacking to |
Despite having just 16 vector registers it is still faster.
2bd491e
to
5b677c3
Compare
tl;dr;Okay, just tried out the latest commit. Looks like PP is stable as compared to main now, while TG is 1.6x faster running CPU-only on the Intel Xeon 6980P! I'm aslo running some perplexity comparisons between CUDA and CPU implementation on the 24 core thread ripper pro and will check in later when that is done. DetailsThis time I made a set of 3 "pure" Quant CollectionAll "pure" except for token_embd is q4_K and the final output is q6_K.
sweep-bench👈 sweep-bench command and datanumactl -N 0 -m 0 \
./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 8704 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
--warmup-batch \
--threads 128 \
--threads-batch 128 \
--numa numactl IQ4_KT PR541@5b677c3c
IQ3_KT PR541@5b677c3c
IQ2_KT PR541@5b677c3c
IQ4_KT main@1843ed22
IQ3_KT main@1843ed22
IQ2_KT main@1843ed22
|
Okay, here are the perplexities as run on the thread ripper pro. I ran all the Qwen3-14B quants on a single RTX A6000 to use the CUDA implementation, and then the three
So looks like the CPU implementation is within the margin of error though shows a very slight increase in perplexity over the CUDA implementation. 👈 Perplexity command and data including error values# For CPU remove `-ngl` and increase threads
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
--model "$model" \
-fa \
-f wiki.test.raw \
--seed 1337 \
-ngl 99 \
--threads 1 CUDA
CPU
ConclusionOverall PR looks like a great speed improvement for token generation of KT quants. Given they still seem CPU bottle-necked at least in this specific test, I'd likely choose the 4bpw version over the smaller sizes when targeting tensors destined for CPU/RAM; because it generates about as fast while keeping more quality. Makes me wonder when a 5bpw or 6bpw version would begin to be RAM bandwidth bottle-necked again, but probably heavily dependent on the specific model and hardware. An iq6_kt probably still would not hit that equivalent RAM / CPU bottleneck cross-over point on the ~512GB/s 6980P... 512 / (27.509 * (6/16)) = ~50 tok/sec theoretical max. To be fair that rig is not hitting theoretical max on the more simple quants, possibly NUMA related but not really sure. Anyway, very cool stuff! Thanks! |
I was too curious to see how it it performed on the AMD Thread Ripper Pro.. Interestingly, there was more variability in the generation speed than with the Xeon 6980P. So I take back my conclusion above about always reaching for the 4bpw... lol... Here is the graph and numbers below. Cheers! 👈 sweep-bench command and data./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 8704 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
--warmup-batch \
--threads 24 \ IQ4_KT PR541@5b677c3c
IQ3_KT PR541@5b677c3c
IQ2_KT PR541@5b677c3c
IQ4_KT main@1843ed22
IQ3_KT main@1843ed22
IQ2_KT main@1843ed22
|
I'm happy enough with the performance now to release the over and out! |
@ubergarm Thank you for the extensive testing! Based on the tests, this looks like a winner, so merging. |
The PR adds some optimizations to the GEMV implementation of the
IQ2_KT, IQ3_KT, IQ4_KT
quants.On my Ryzen-7950X I don't notice much of a difference when running with 16 threads as the calculation is (nearly) memory bound. But when testing with fewer threads, I see quite significant gains in TG performance compared to the main branch. Here some results for LlaMA-3.1-8B-Instruct
IQ2_KT
IQ3_KT
IQ4_KT
@ubergarm
In your performance testing on the 6980P system
iqX_kt
quants were very far from saturating memory bandwidth, so perhaps you will see bigger gains there than I see on my system when using all cores.