-
Notifications
You must be signed in to change notification settings - Fork 154
Much faster CPU prompt processing (part 3) #534
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
q8_0 is fine, but I observe a very significant PPL increase for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit scale conversions.
With that q4_0 now works. I need to check all quants that use q8_2_x4!
* Repack q4_0 and q8_0 to q8_0_R8 q8_0 is fine, but I observe a very significant PPL increase for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit scale conversions. * Change q8_2_x4 to store in16_t sums With that q4_0 now works. I need to check all quants that use q8_2_x4! * q5_0 and use a dequntizing template * q6_0 129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s. * iq4_nl 137 t/s -> 293 t/s. iq4_nl is at 251 t/s. * q4_1: 135 t/s -> 262 t/s * q5_1: 125 t/s -> 253 t/s * iq3_xs 178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s. * q2_K 202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s. --------- Co-authored-by: Iwan Kawrakow <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
`
float d = _mm_cvtss_f32(max4/127.f);
`
This line (2077) in idk_gemm_kquants.cpp provokes this error in MSVS 22 (Win 11) :
binary '/': '__m128' does not define this operator or a conversion to a type acceptable to the predefined operator.
I compile with AVX2 and FMA enabled.
Should be fixed now. |
@ikawrakow : It is, thank you! |
This 3 part refresh on PP performance across so many quants is epic, appreciate your explaining the details in your PR notes.
Great to see this one in there too, I ran into it yesterday playing with moonshotai/Kimi-Dev-72B which is a fine-tune of Qwen-2.5-72B architecture. Turns out for those models the
I saw some notes on vLLM about padding out 29568 + 128 intermediate size before quantization and I believe turboderp's exllamav3 Are there any quantization/padding options I have to deal with this I'll need to re-run some llama-sweep-bench testing, but I made a shotgun collection of experimental quants of this dense 72B hoping to find a good mix for 16-24GB VRAM hybrid inferencing. While the prompt processing speeds are excellent (especially given probably less than 32k context), the token generation speeds seem bottlenecked by RAM i/o. The solution there is use a smaller size quant to fit more layers on GPU, but that directly eats into Perplexity score. I'm still feeling around for that "knee" point in the curve to get a fair trade-off in TG and Perplexity. No wonder many folks are choosing MoEs for hybrid inference over dense 72Bs. Moe's fewer active weights during TG yield faster speeds with larger overall parameter size models. |
TG performance of MoE models is far away from what is theoretically possible. If I look at your 6980P system, IIRC it has in the range of 512 GB/s memory bandwidth per node. So that, running DeepSeek on a single node because we haven't learnt how to do the NUMA thing effectively, and getting 10 t/s for 20 GB worth of active parameters means we are a factor of 2.5X away from what should be achievable. I do fully saturate memory bandwidth of my systems with the dense models I can run, so I was hoping that one can get that with a 70B dense model as well (on a higher bandwidth system). If so, quantized at 4 bpw one should be getting in the range of 15 t/s TG on your rig for this 70B dense model running CPU only.
If I was the Emperor of the Universe, I would put people creating models with strange tensor dimensions in prison. They haven't heard that modern computing architectures strongly prefer to operate on data sizes that are a high power of 2? And I mean, do they really believe that it makes a difference if the FFN tensors were 29440 or 29696 instead of 29568? Hahaha.
Padding was discussed back in the day, but the idea was discarded. After all, it is |
I do think now that we have the -ot, if the GGUF were changed to split up the experts and you launched it with |
Always appreciate your insights, and these new prompt processing numbers are looking great on avx2 CPUs!
I ran My impression is that the big 6980P CPU is not saturating the expected ~512GB socket RAM bandwidth during generation. As you mentioned it could hit theoretically ~15 tok/sec (512 GB bandwidth / 32GB model size = 16 tok/sec). I spot checked using 80 and 64 threads for TG on the Intel Xeon 6980P, but less threads led to slower generation for this benchmark. Perhaps because its 3x CCDs are configured as a single NUMA node via BIOS config While the 24x Core 7965WX Thread Ripper Pro is doing better, it has 4x CCDs configured as a single NUMA node via NPS1 which could possibly be causing a hit to TG performance. Assuming the benchmarked ~512GB/s RAM bandwidth on the 6980P and let's call it ~256 GB/s on the Thread Ripper Pro are accurate, the potential token generation breakdown looks like this:
I want to like the ~70B dense models, but man they are difficult to get good TG without offloading the whole thing to VRAM... I could try my home AMD 9950X given it would fit, even with lower absolute TG speeds it could be more "efficient" given native single NUMA node... EDIT I ran one on my home 9950X benching ~87GB/s with (overclocked inifinity fabric at "gear 1" ratios) and updated the graph and table above. 👈 Commands, Data, Model DescriptionsQ4_0extra pure
smol-IQ3_K(its called
IQ3_KTusing the most recent PR merged into main
# on the Thread Ripper Pro I removed numactl stuff and used 24 threads.
numactl -N 0 -m 0 \
./build/bin/llama-sweep-bench \
--model "$model" \
--ctx-size 6144 \
-ctk q8_0 -ctv q8_0 \
-fa \
--no-mmap \
-ub 2048 -b 2048 \
--warmup-batch \
--threads 128 \
--threads-batch 128 \
--numa numactl 6980P Q4_0 -t 128
6980P smol-IQ3_K -t 128
6980P IQ3_KT -t 128
7965WX Q4_0 -t 24
7965WX smol-IQ3_K -t 24
7965WX IQ3_KT -t 24
9950X smol-IQ3_K -t 16
9950X smol-IQ3_K -t 16 -ngl 48 (NOT GRAPHED, JUST FOR FUNZIES)
I've uploaded the smol-IQ3_K to hugginface here.
I was checking how bullerwins dealt with the goofy dimensions ffn_down.. Given they use
I didn't look into it further, and used
Right, related to the The PP performance on the Another similar benchmark as above, but now for DeepSeek-R1-0528 MoE. I run here offloading the same number of layers on GPUs to not OOM RAM. This is just the Thread Ripper Pro, 24 core, default batch sizes: IQ3_KS_R4 300.938 GiB (3.847 BPW)
IQ3_KT 272.527 GiB (3.483 BPW)
👈 llama-sweep-bench details and dataIgnore the PP given this was low batch sizes so not a good comparison. #model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ3_K_R4/DeepSeek-R1-0528-IQ3_K_R4-00001-of-00007.gguf
model=/mnt/raid/hf/DeepSeek-R1-0528-GGUF/IQ3_KT/DeepSeek-R1-0528-IQ3_KT-00001-of-00006.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
--no-mmap \
--ctx-size 8704 \
-ctk f16 \
-mla 3 -fa \
-fmoe \
-amb 512 \
-ngl 99 \
-ot "blk\.(3|4|5|6|7|8|9)\.ffn_.*=CUDA0" \
-ot "blk\.(10|11|12|13|14|15|16)\.ffn_.*=CUDA1" \
-ot exps=CPU \
--warmup-batch \
--threads 24 IQ3_KS_R4
IQ3_KT
So given DeepSeek-R1-671B has active 37B during generation and the theoretical max bandwidth on the 256GB/s Thread Ripper Pro we can use the calculate the GiB of the active parameters and get theoretical max TG as above.
but need to account for GPU offload of 1 shared expert, 3 dense layers, and first 16 routed exps layers leaving ~30B active on CPU/RAM
Then, assuming any of this is close, the "Yield" is fairly close to the the dense model above. The
Thanks again for these great PP speed-ups and your time and patience with my long ass posts haha.. I gotta eat some dinner now, cheers! |
Yes, the |
This PR is a follow up of #531 and #533, and adds much faster GEMM for the remaining non-interleaved quants:
Q2_K, IQ4_XS, IQ4_NL, Q4_0, Q4_1, Q5_0, Q5_1, Q6_0, Q8_0
.Here is a PP-512 performance comparison between the main branch and this PR for LLaMA-3.1-8B-Instruct on a Ryzen-7950X CPU:
We observe gains in the range of 2X for all types. In case anyone is wondering why we see 3 performance levels, this is simply due to the quantization type to which the data gets repacked:
Q2_K
andIQ4_XS
get repacked toQ8_K_R8
, and hence have a higher performance due to the fasterQ8_K_R8 x Q8_K
GEMMIQ4_NL, Q4_0, Q5_0, Q6_0, Q8_0
get repacked toQ8_0_R8
, soQ8_0_R8 x Q8_2_X4
GEMM gets used, and they all end up with PP-512 in tghe 290-300 t/s rangeQ4_1
andQ5_1
get repacked toQ8_1_R8
(they must due to being "type-1" quants), and that results in the lower performance around 250 t/s