Replies: 3 comments 6 replies
-
I played around with offline repacking next. Oh boy. Offline repacking on 4096 batch.
It seems like performance here is identical to using -rtr. Debuff to text generation likely from mmap. Ok.. so let's try it in a configuration where repacking previously helped like the last one in the previous post. Only 6 layers are incorrectly packed and everything has gone into the toilet.
Then I indiscriminately repacked the whole model to see what would happen. It got just as bad. Lots of transfers.Could be related to offload policy? I didn't even bother waiting for the first iteration it took so long. CPU running at 10 cores from the 1000% usage. And finally I packed the model correctly AND used the configuration that produced a speed gain. with mmap
no mmap
Does it help to cache the model first? Let's run with mmap again....
NOPE! So the point to the whole story, if anyone cares, is that even a few mis-packed layers will tank your speeds. Feels like there is no point to posting R4/R8 quants because the user will have to repack them anyway unless using the EXACT configuration of the author. What am I missing here? As a bonus.. let's find where RTR starts to help prompt processing... First I'll take a new baseline because it seems textgen is not working so good after packing/loading/etc. Could be I need to drop caches? 4096 no rtr/no-mmap Baseline
That's the highest we will get for now. 2048 without RTR with no-mmap
2048 with rtr
So still a debuff to prompt processing and a mild gain to t/g Let's try something else.... 2048/1024 -rtr
2048/1024 -no rtr and no-mmap
Ok.. now prompt processing finally fell.. the original observed effect. So then -rtr or repacking is only useful in the case of ub being half the batch size? It does allow you to generate text a little bit faster in every test at least. |
Beta Was this translation helpful? Give feedback.
-
Perhaps to understand how repacked quants behave on the CPU and CUDA, it is easier to take a smaller model that would completely fit one GPU, quantize with with
It is an easy exercise, does not require an imatrix as you are not after the best possible quantization quality, and if you pick a model that is not too large, it is very quick to do. Without having understood what the repacking does or does not do for you, it becomes very hard to sort out the big models with partial offloads, offload policy, numa, what runs on the GPU or CPU when and why, etc. |
Beta Was this translation helpful? Give feedback.
-
Finally got around to testing a smaller model. Non IQ quant as well. DeepSeek-V2-Lite-Chat.i1-Q4_K_M
No RTR 48c CPU distribute, cache on GPU
RTR 48c CPU distribute, Cache on GPU (iqk_repack_tensor(output.weight): q6_K -> q6_k_r4. 102400 rows, 3200 chunks, 48 threads)
24 cores, numa isolate + RTR + no interleave
24 cores, no interleave + no rtr + numa isolate
GPU Fully
No GPU full cores no rtr
No GPU full cores RTR
It looks like on this system, RTR only helps when there is no GPU involved or the ubatch is 1024 (previous tests). In every other case, RTR lowers the prompt processing by a lot but improves TG. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I had long assumed that -RTR was a universal speedup and just like repacking, it would help your performance always. Seems that is not the case.
Qwen 235b command line
No RTR Buffers
Buffers with RTR
It's even worse on deepseek where my prompt speeds were cut in half while losing about 1.5t/s of TG only. Another thing of note is that no repacking causes much more large transfers to the GPU. I saw rates of up to 16GBs going between cards and I assume the system?
Peculiar thing though, for smaller batches:
235b ub 1024
Without -rtr, this makes ~120 prompt at most. Anyone know the why or noticed something similar?
Beta Was this translation helpful? Give feedback.
All reactions