Questions about weight[j] #140

DavidZyy · 2024-12-13T17:50:05Z

DavidZyy
Dec 13, 2024

Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion quantize_row_q4_0_impl and other places, weight[j] is:

weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);

I already see some discussions at here, but I still don't quite understand, Can you give me some guidance? Why do not use the following directly?

weight[j] = qw[j]

ikawrakow · 2024-12-14T08:13:19Z

ikawrakow
Dec 14, 2024
Maintainer

Hi @DavidZyy,

this is simply an empirical correction, there is no science behind it (and it was amusing to observe people trying to make scientific sense out of it). From the pre-imatrix days we have learned that it is better to assign higher weights (importance) to model weights with larger magnitudes in a weighted RMSE minimization. As there is no precise science behind that, it was just a matter of experimentation to determine how this higher importance should look like ($x^2$, $|x|$, $\sigma^2 + x^2$, $\sigma + |x|$, etc., are all variations that have been tried). When I introduced the imatrix, the hope was of course that one can get rid of such non-scientific stuff and just use the diagonal elements of the Hessian. But in practice it is rarely as simple as that. Having the $\sqrt{\sigma^2 + x^2}$ in there does improve quantization accuracy, at least as measured by perplexity or KL-divergence.

Why $\sqrt{\sigma^2 + x^2}$ and not something else?

As the Hessian already gives a lot of information about model weight importance, at some level it should be clear that the empirical correction cannot be as strongly magnitude dependent as it was without the imatrix
We definitely do not want to have the importance of small-magnitude weights become (nearly) zero
Based on the above two bullet points, and the experience from pre-imatrix quantization, $\sqrt{\sigma^2 + x^2}$ was an obvious choice that turned out to work better than anything else I tried

Why the need for correcting the Hessian in the first place?

We are using just the diagonal elements, which is an approximation. In my experience adding a correction to an approximation often improves things
From a more conceptual point of view, even if we did use the full Hessian, we still don't know if RMSE between the quantized and the full model weights is the similarity measure that we should be minimizing. RMSE is of course very convenient (expressions are very simple), so not knowing what to minimize we just use that. But in reality another similarity measure may be better, and it will have a different Hessian, so a different importance matrix, so we are back to square one where the importances being used are just a matter of empirical experimentation.

0 replies

DavidZyy · 2024-12-14T13:58:43Z

DavidZyy
Dec 14, 2024
Author

Thanks for taking time to answer this question and share information, I learned a lot from your answers.
Yes, it's very interesting :)

(and it was amusing to observe people trying to make scientific sense out of it)

0 replies

jukofyork · 2025-02-10T17:03:34Z

jukofyork
Feb 10, 2025

Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation!

FWIW, I actually tested a couple of different schemes that were more grounded in regularisation theory, but they performed worse than your empirical method. It would still be nice to find some way to interpolate between the two extremes; the recent 256-expert being a good case in point!

I did manage to fix some of this back when dbrx first dropped:

ggml-org/llama.cpp#7099

IIRC, all the main discussion is in this PR:

ggml-org/llama.cpp#6387 (comment)

but I still suspect that for these new very-high-expert-MoEs it should really be down-regularised compared to non-MoE or older low-expert-count-MoEs.

0 replies

ikawrakow · 2025-02-10T18:07:55Z

ikawrakow
Feb 10, 2025
Maintainer

@jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. This repository uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the ggml mnist example by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., IQ2_XXS uses 256 out of 6561 points on the E8 lattice. This prevents overfitting, thus can be considered as "regularization".

The other thing I have learned is that theories are rarely useful in their pure form. More often than not, you start with this beautiful theory to only find that it does not work very well in practice. So, you start adding fudge factors, and things get better. And then you add even more fudge factors and it gets better. When you are done with it you have something that works really well, but you barely recognize the beautiful pure theory you started from.

Just my 2 cents

11 replies

saood06 Feb 11, 2025
Collaborator

With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.

Experts	PPL
8	3.4155, 4.2311, 3.0817, 2.8601, 2.6933, 2.5792, 2.5123, 2.5239
16	3.5350, 4.3594, 3.0307, 2.8619, 2.7227, 2.6664, 2.6288, 2.6568
6	3.4227, 4.2400, 3.1610, 2.9933, 2.8307, 2.7110, 2.6253, 2.6488
4	3.5790, 4.5984, 3.5135, 3.4490, 3.2952, 3.2563, 3.1883, 3.2978
3	3.9209, 4.9318, 4.0944, 4.2450, 4.2071, 4.3095, 4.3150, 4.6082
2	6.2387, 7.7455

Here's another user who reported only lower expert usage.

Model	[1]	[2]	[3]	[4]	[5]	[6]	[7]	[8]
IQ2_XXS	3.39	4.56	3.44	3.27	3.27	3.20	3.12	3.12
IQ3_XXS (exp=3)	3.12	4.03	2.93	2.63	2.52	2.48	2.45	2.48
IQ3_XXS (exp=4)	2.87	3.61	2.60	2.25	2.09	1.97	1.89	1.87
IQ3_XXS (exp=6)	2.67	3.53	2.53	2.13	1.94	1.80	1.71	1.65
IQ3_XXS (def)	2.69	3.53	2.51	2.11	1.91	1.78	1.69	1.62

jukofyork Feb 11, 2025

but are you sure that is recommended?

I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for IQ1_S and IQ1_M. IIRC, 3 instead of 2 active experts did improve IQ1_S and IQ1_M quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
* For 7 experts PPL is slightly lower (-0.2%)

* For 8 and 9 experts it is about the same

* For 10 experts PPL is ~0.3% higher.

Yeah, I managed to do this with dbrx before the PR that fixes the divisors for the experts separately. IIRC, I actually activated all the experts for dbrx and it got a better resulting imatrix than the pre-PR code did, and was quite usable.

jukofyork Feb 11, 2025

With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.

This could be because most previous MoEs use softmax to gate/weight with, so as you add more experts is scales down the weights, but deepseek-v3 uses sigmoids, so the sum getting added into the hidden state will get larger and larger (you can probably also hack the weights and bias to counter this though).

EDIT:

INFO:hf-to-gguf:blk.11.exp_probs_b.bias,      torch.float32 --> F32, shape = {256}
INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 256}

saood06 Feb 11, 2025
Collaborator

deepseek-v3 uses sigmoids, so the sum getting added into the hidden state will get larger and larger

Then why does 16 experts work, but not 10/12?

jukofyork Feb 11, 2025

deepseek-v3 uses sigmoids, so the sum getting added into the hidden state will get larger and larger

Then why does 16 experts work, but not 10/12?

Not sure - seems very strange!

Only thing i can think of is some have negatively correlated outputs, and the sum of 16 cancels out the error that overflows whereas with 10 or 12 it doesn't?

Questions about weight[j] #140

Uh oh!

DavidZyy Dec 13, 2024

Replies: 4 comments · 11 replies

Uh oh!

ikawrakow Dec 14, 2024 Maintainer

Uh oh!

DavidZyy Dec 14, 2024 Author

Uh oh!

jukofyork Feb 10, 2025

Uh oh!

ikawrakow Feb 10, 2025 Maintainer

Uh oh!

Uh oh!

saood06 Feb 11, 2025 Collaborator

Uh oh!

jukofyork Feb 11, 2025

Uh oh!

Uh oh!

jukofyork Feb 11, 2025

Uh oh!

saood06 Feb 11, 2025 Collaborator

Uh oh!

jukofyork Feb 11, 2025

DavidZyy
Dec 13, 2024

Replies: 4 comments 11 replies

ikawrakow
Dec 14, 2024
Maintainer

DavidZyy
Dec 14, 2024
Author

jukofyork
Feb 10, 2025

ikawrakow
Feb 10, 2025
Maintainer

saood06 Feb 11, 2025
Collaborator

saood06 Feb 11, 2025
Collaborator