Replies: 3 comments 1 reply
-
So, things are changing fast in this repository. Yes, it used to be true that row-interleaved quants offer better PP performance. But then I optimized non-interleaved quants in PRs #531, #533, #534 (AVX2) and #549, #550, #552 (ARM_NEON), so now non-interleaved quants have a better PP performance. The better TG performance is unexpected. I haven't checked these models closely, but I wouldn't be surprised if the
Please file an issue with your command. If you could run in the debugger and do a backtrace when it crashes, that would be great! |
Beta Was this translation helpful? Give feedback.
-
Yup as ik says things are moving fast. In lieu of the recent optimizations for non-interleaved quants I have moved away from releasing the Some folks have been re-mixing my recipes using the newer quantizations available and non-interleaved forms with good results as described here: #616 (comment) |
Beta Was this translation helpful? Give feedback.
-
I have dual xeon with GPUs and in my experience -RTR or static packed quants only helped prompt processing at certain batch sizes. Otherwise it would lower speeds. In my case, TG would improve a bit, so I'm surprised yours did not. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Comparing @ubergarm R4 quants to Unsloth's UD quants, I get significantly worse performance in ik_llama.cpp with the R4 quants:
These aren't exactly the same quants, but the difference seems significant.
Am I doing something wrong? And am I making the best use of these ik_llama-tuned R4 quants?
(Also, I usually get around 3.5 tg t/s on the UD R1 quant, but the pp and tg varies between reboots and cache clears, for some reason.)
Details
Run command for the above table:
(I know that
-mla 3
is recommended for best performance, but I get segfaults on the UD quant with anything other than-mla 0
.)My hardware is an old Dell PowerEdge R740, with:
ik_llama.cpp config:
3736
, commit1eabdb42
This may or may not be relevant: This is all running on a
Ubuntu 24.04.2 Server
VM hosted onProxmox Virtual Environment 8.4.0
on the Dell R740. I've configured the VM to use the host CPU, with 1:1 mappings to cores, with NUMA enabled on both Host and Guest and equal memory bindings across the two NUMA nodes. Even if this was a performance impediment, I'd still expect Ubergarm's R4 to perform better relatively to UD in ik_llama.cpp, right?Beta Was this translation helpful? Give feedback.
All reactions