Replies: 4 comments 11 replies
-
Hi @DavidZyy, this is simply an empirical correction, there is no science behind it (and it was amusing to observe people trying to make scientific sense out of it). From the pre-imatrix days we have learned that it is better to assign higher weights (importance) to model weights with larger magnitudes in a weighted RMSE minimization. As there is no precise science behind that, it was just a matter of experimentation to determine how this higher importance should look like ( Why
Why the need for correcting the Hessian in the first place?
|
Beta Was this translation helpful? Give feedback.
-
Thanks for taking time to answer this question and share information, I learned a lot from your answers.
|
Beta Was this translation helpful? Give feedback.
-
Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation! FWIW, I actually tested a couple of different schemes that were more grounded in regularisation theory, but they performed worse than your empirical method. It would still be nice to find some way to interpolate between the two extremes; the recent 256-expert being a good case in point! I did manage to fix some of this back when IIRC, all the main discussion is in this PR: ggml-org/llama.cpp#6387 (comment) but I still suspect that for these new very-high-expert-MoEs it should really be down-regularised compared to non-MoE or older low-expert-count-MoEs. |
Beta Was this translation helpful? Give feedback.
-
@jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. This repository uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the ggml mnist example by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., The other thing I have learned is that theories are rarely useful in their pure form. More often than not, you start with this beautiful theory to only find that it does not work very well in practice. So, you start adding fudge factors, and things get better. And then you add even more fudge factors and it gets better. When you are done with it you have something that works really well, but you barely recognize the beautiful pure theory you started from. Just my 2 cents |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion
quantize_row_q4_0_impl
and other places,weight[j]
is:weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
I already see some discussions at here, but I still don't quite understand, Can you give me some guidance? Why do not use the following directly?
Beta Was this translation helpful? Give feedback.
All reactions