-
I've been running unsloth's IQ4_XS quant of Qwen 235B Instruct with upstream llama.cpp, as that is the highest quant I can fit on my 128GB mac studio with 32k context. Just read about ik_llama.cpp on reddit and saw that someone uploaded a IQ4_KSS version, which apparently uses less memory. I'm curious how this new IQ4_KSS compares with IQ4_XS. Should I download the new quant or are they going to be too close to notice any differences? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
There really isn't a good way to know this without using it as every different use and user changes what differences matter.
Well there is a lot of info about this in the IQ4_KSS PR here: #89 and I would also recommend reading this #83 as this PR shows an equivalent size newer quant vs IQ4_XS. Another user asked a similar thing here: #334 (reply in thread) and this is also worth reading |
Beta Was this translation helpful? Give feedback.
-
Thanks @saood06 I will give it a shot as IQ4_KSS allows for extra context on my setup. Unrelated to the original question, but does ik_llama.cpp CPU inference improvements introduces some sort of regression on Apple silicon metal inference? I did build ik_llama.cpp locally and ran llama-bench on my existing IQ4_XS weights, and it seems to have half the pp512 speed:
This is on a Mac Studio M1 Ultra with 125GB VRAM. I noticed that the ik version doesn't have "BLAS" in the backend column. |
Beta Was this translation helpful? Give feedback.
There really isn't a good way to know this without using it as every different use and user changes what differences matter.
Well there is a lot of info about this in the IQ4_KSS PR here: #89 and I would also recommend reading this #83 as this PR shows an equivalent size newer quant vs IQ4_XS. Another user asked a similar thing here: #334 (reply in thread) and this is also worth reading