Replies: 8 comments 24 replies
-
Just to be clear: I wasn't implying you had done anything wrong and merely showing something that I had noticed and spent a couple of hours playing with last year (which I never mentioned before as it wasn't clear it was of any use nor related to anything useful). I'm sorry if I've come across badly as this isn't my intention - I've nothing to gain from any of this, but just find it interesting :) If you search my nick you can find similar posts by me on the now dead 2+2 forums (everything is on discord now sadly) on similar topics from 25+ years ago! |
Beta Was this translation helpful? Give feedback.
-
@jukofyork Sorry if I have come across a bit harsh. But it is interesting stuff indeed, so we all can get passionate about it. Anyway, attached is a very simple C++ program that illustrates the asymmetry of the scaled distribution. Here is what it does:
Here is a plot of the computed average as a function of sample size
to generate the data in the graph (a negative sample size will cause the program to loop between 2 and the absolute value of the argument given). |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
Thank you. Your existing work on
Right. I will consider reverting back to the existing quantization methods when I was hoping the more exhaustive algorithms would always be better (since they are better at minimizing the weighted squared error), but when they optimize the wrong thing (when no But I also suspect the default weights for the weighted rounding without Aside: is there a generally better solution for the default importance weights (without
Strange, the increase in quantization time for And there are still some adjustments I did not try yet and which could improve both the time (by a noticeable factor) and perplexity (hopefully), which is to add the same "clamping protection" as my linear weighted rounding algorithms (e.g. in I value your feedback, which is why I'll try to improve on this point (or exclude the changes to
I do use $ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>
Yeah, I did notice that. The search algorithms I've made can be adapted to other metrics (although that can also be said of the existing algorithms for k-quants, since they also use weighted squared error), as long as they can be calculated cumulatively. I'd like to find better surrogates, and more exhaustive search algorithms which are not brute-force (yet still yield optimal-looking results) can help with that, even though for now minimizing weighted squared error on the model tensors doesn't quite match the actual thing we want to minimize (PPL and KLD), which makes your carefully tuned heuristics superior for now.
On which model(s) did you observe this? I'd like to reproduce this observation.
Right, but the test corpus is not infinite, and for a small test corpus I actually find KLD faster for meaningful comparisions (because the ± error goes down faster than for But I agree PPL is more convenient for quickly comparing versions of quants of a lot of different models (because the logits files get big really fast), at least when using a GPU.
Yes, totally agree! And technically I already got what I wanted out of these algorithms (even if they are not merged or not better), which is the very nice plots they can make to hopefully help me understand a bit more the representable vector space of both linear and non-linear quants, especially when viewed appropriately in a 360 degree panorama viewer: https://blobs.compilade.net/pannellum.htm#panorama=equirectangular-iq4nl-qkxs-2048.png. |
Beta Was this translation helpful? Give feedback.
-
It is a heuristic. Trial and error. IIRC, higher bpw quants do better with a stronger large magnitude weighting (e.g.,
Go back to the basics. Start with LLaMA-v1-7B. I know, nobody uses that today. But then again, almost all of k-quants development was based on the experience with the LLaMA-v1 models, and k-quants have done surprisingly well in the almost two years since they were released on the thousands of models they have been tried on. Even today when I want to try a new quantization idea, I always check performance with LLaMA-v1, LLaMA-v2, and Mistral-7B. Your
Oh, I used |
Beta Was this translation helpful? Give feedback.
-
You may be interested in PR #295 |
Beta Was this translation helpful? Give feedback.
-
While not directly related to the quants specific to #295 , I did just release what may be one of the best quants (for generation quality) in its size class for It only works with I haven't done full perplexity and benchmarking comparisons across the major quant cookers versions, but have a rough table showing the differences between ubergarm, @bartowski1182, @danielhanchen (unsloth), and eventually mradermacher's recipes. I'll add it in the fold here for convenience. Big thanks to y'all doing so much inspirational work and making this stuff more and more accessible! 👇 👈 V3-0324 quant recipe comparison table
☝️ |
Beta Was this translation helpful? Give feedback.
-
For Maverick you reported hitting this over protectiveness issue in llama.cpp
That issue has been addressed here in #202 but you may need to adjust it to allow 10% missing to get the blk.1 tensors as well (but block 45 is below 50% which seems very odd). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
@compilade has submitted an interesting PR in the mainline
llama.cpp
repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in thellama.cpp
project, I'll address the pings here.@compilade's PR
First of all, this is a nice piece of work, so congratulations!
I did try the PR on a few models. I focused on
Q3_K
andIQ4_NL
as I don't see the utility of using quantization types meant for ternary models (TQ1_0
,TQ2_0
) also for non-ternary models, and am also not particularly interested in the legacy quantization types (Q4_0
,Q5_0
, too low quality relative to the bits spent). I could have also looked atIQ4_XS
, but it is very similar toIQ4_NL
, so here we go with my observations:Q3_K
is significantly better than the existing quantization method (but see below).IQ4_NL
is hit-or-miss - sometimes slightly better, sometimes slightly worse, but overall not much of a difference apart from the 5X increase in quantization time.llama.cpp
it wasn't clear that it will take off the way it did. Hence, the quantization methods I contributed are the way they are. Perhaps they are suboptimal when there is a (meaningful) imatrix, but a major driving force was to make them as robust as possible for quantization without imatrix.--pure
, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others. One gets excited about having improved things, but then in practice, with the high-impact tensors quantized with more bits in the quantization mix, suddenly the observed quality is lower than what one had before. Case in point,Q3_K_M
with your PR often has a higher PPL than the existing quantization, despite being clearly better with--pure
--pure
: in some models token embedding quantization has a disproportional impact on observed quality, and some quantization types do not quantizetoken_embd.weight
very well. You do useQ8_0
for the output tensor, I think it would be better to also useQ8_0
for token embeddings when using--pure
.IQ4_K
andIQ5_K
here are miles ahead of any 4- or 5-bpw quantization type in mainlinellama.cpp
. Hence, I'm skeptical that they can be improved with your PR (but you are more than welcome to submit a PR here if you are able to demonstrate improvement).IQ2_K
andIQ3_K
are on par or slightly better than i-quants with similar size, so before improving these you have to find a way to apply the methods of your PR toIQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S
(one of your TODO items).TQ2_0
being faster thanIQ1_S
: in theory, sure. In practice, the table below shows what I observe with the PR branch forTQ2_0
, and withik_llama.cpp
forIQ1_S
(using the row-interleaved variantIQ1_S_R4
):@jukofyork's ideas
If you start with a fully symmetric probability distribution (not always the case, but for simplicity let's assume it is fully symmetric), and you draw a finite number of random samples from it (the wights in one quantization block), you then scale the sampled values such that the maximum magnitude value always takes the same scaled value, you end up with a non-symmetric probability distribution for the scaled samples. The smaller the sample size, the larger the asymmetry. With the sample size approaching infinity, the observed probability distribution will become symmetric. You can ask WolframAlpha about it, or you can write a simple script that samples 32 values from a Gaussian distribution, scales, and scores the resulting scaled pdf.
Anyway, this is why the
IQ4_NL
(andIQ4_XS
, as well as theIQ2_K, IQ3_K
quants from this repository) quant lookup tables are asymmetric (and not because I'm a moron who didn't know how to make a symmetric function). But, if you don't accept this for granted (you most likely don't), just go and replacekvalues_iq4nl
inggml-quants.c
with your symmetric variant, and watch the disaster that ensues. You need to do it at a few more places because for some reason this table is not inggml-common.h
as it should be.1 I know, I know. The Internet Gods have spoken: PPL doesn't tell us anything and is completely useless; KLD is the one and only one true measure of quantization quality. But me, not being a religious person, and having quite a bit of research experience under my belt, I don't take the God's opinions for granted. I have written elsewhere about the equivalence of PPL and KLD for an infinitely large test corpus, and about the superiority of PPL for a test corpus of limited size, so I will not repeat myself here.
Beta Was this translation helpful? Give feedback.
All reactions