On @compilade's PR 12557 and @jukofyork's quantization ideas #288

ikawrakow · 2025-03-25T12:26:20Z

ikawrakow
Mar 25, 2025
Maintainer

@compilade has submitted an interesting PR in the mainline llama.cpp repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the llama.cpp project, I'll address the pings here.

@compilade's PR

First of all, this is a nice piece of work, so congratulations!

I did try the PR on a few models. I focused on Q3_K and IQ4_NL as I don't see the utility of using quantization types meant for ternary models (TQ1_0, TQ2_0) also for non-ternary models, and am also not particularly interested in the legacy quantization types (Q4_0, Q5_0, too low quality relative to the bits spent). I could have also looked at IQ4_XS, but it is very similar to IQ4_NL, so here we go with my observations:

Without imatrix, the existing quantization methods are strictly better than your PR as measured by perplexity¹
With imatrix and pure quantization, your Q3_K is significantly better than the existing quantization method (but see below). IQ4_NL is hit-or-miss - sometimes slightly better, sometimes slightly worse, but overall not much of a difference apart from the 5X increase in quantization time.
When I added the imatrix to llama.cpp it wasn't clear that it will take off the way it did. Hence, the quantization methods I contributed are the way they are. Perhaps they are suboptimal when there is a (meaningful) imatrix, but a major driving force was to make them as robust as possible for quantization without imatrix.
I have run into this on a number of occasions when I was still actively working on quantization: in many models some tensors have a disproportionally high impact on the observed quantization quality. So, when using --pure, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others. One gets excited about having improved things, but then in practice, with the high-impact tensors quantized with more bits in the quantization mix, suddenly the observed quality is lower than what one had before. Case in point, Q3_K_M with your PR often has a higher PPL than the existing quantization, despite being clearly better with --pure
More on --pure: in some models token embedding quantization has a disproportional impact on observed quality, and some quantization types do not quantize token_embd.weight very well. You do use Q8_0 for the output tensor, I think it would be better to also use Q8_0 for token embeddings when using --pure.
It is not that I didn't know how to implement exact minimization of RMSE (or maximization of cosine similarity, if that's what you prefer). The existing methods are the way they are because of the observation that the exact solution of the optimization problem often leads to disastrous results for observed quantization quality. RMSE (or cosine similarity) are just surrogates, so finding a better solution does not automatically lead to better quantization quality. I have seen people describe some of the k- and i-quant quantization methods as "brute force". They are not (brute force will look completely different and would take much longer. Also, the moment we decided to use brute force, that would be the moment where we would plug in an exact solution method that runs many times faster than brute force). They use carefully tuned heuristics to avoid the quants getting lost in the fields. When the iamtrix came along I was exited to use exact solution methods instead of heuristics. Unfortunately, even with an imatrix, one can (and often does) end up with a worse outcome with quantized weights that are more similar to the original model weights (as measured by the surrogate).
IQ4_K and IQ5_K here are miles ahead of any 4- or 5-bpw quantization type in mainline llama.cpp. Hence, I'm skeptical that they can be improved with your PR (but you are more than welcome to submit a PR here if you are able to demonstrate improvement). IQ2_K and IQ3_K are on par or slightly better than i-quants with similar size, so before improving these you have to find a way to apply the methods of your PR to IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S (one of your TODO items).
On TQ2_0 being faster than IQ1_S: in theory, sure. In practice, the table below shows what I observe with the PR branch for TQ2_0, and with ik_llama.cpp for IQ1_S (using the row-interleaved variant IQ1_S_R4):

model	size	params	backend	threads	test	t/s
llama 8B TQ2_0 - 2.06 bpw ternary	2.72 GiB	8.03 B	CPU	16	pp512	153.05 ± 0.29
llama 8B TQ2_0 - 2.06 bpw ternary	2.72 GiB	8.03 B	CPU	16	tg128	23.79 ± 0.00
llama 8B IQ1_S_R4 - 1.5 bpw	2.39 GiB	8.03 B	CPU	16	pp512	184.46 ± 1.36
llama 8B IQ1_S_R4 - 1.5 bpw	2.39 GiB	8.03 B	CPU	16	tg128	26.86 ± 0.00

@jukofyork's ideas

If you start with a fully symmetric probability distribution (not always the case, but for simplicity let's assume it is fully symmetric), and you draw a finite number of random samples from it (the wights in one quantization block), you then scale the sampled values such that the maximum magnitude value always takes the same scaled value, you end up with a non-symmetric probability distribution for the scaled samples. The smaller the sample size, the larger the asymmetry. With the sample size approaching infinity, the observed probability distribution will become symmetric. You can ask WolframAlpha about it, or you can write a simple script that samples 32 values from a Gaussian distribution, scales, and scores the resulting scaled pdf.

Anyway, this is why the IQ4_NL (and IQ4_XS, as well as the IQ2_K, IQ3_K quants from this repository) quant lookup tables are asymmetric (and not because I'm a moron who didn't know how to make a symmetric function). But, if you don't accept this for granted (you most likely don't), just go and replace kvalues_iq4nl in ggml-quants.c with your symmetric variant, and watch the disaster that ensues. You need to do it at a few more places because for some reason this table is not in ggml-common.h as it should be.

¹ I know, I know. The Internet Gods have spoken: PPL doesn't tell us anything and is completely useless; KLD is the one and only one true measure of quantization quality. But me, not being a religious person, and having quite a bit of research experience under my belt, I don't take the God's opinions for granted. I have written elsewhere about the equivalence of PPL and KLD for an infinitely large test corpus, and about the superiority of PPL for a test corpus of limited size, so I will not repeat myself here.

jukofyork · 2025-03-25T12:48:44Z

jukofyork
Mar 25, 2025

@compilade has submitted an interesting PR in the mainline llama.cpp repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the llama.cpp project, I'll address the pings here.

@jukofyork's ideas

If you start with a fully symmetric probability distribution (not always the case, but for simplicity let's assume it is fully symmetric), and you draw a finite number of random samples from it (the wights in one quantization block), you then scale the sampled values such that the maximum magnitude value always takes the same scaled value, you end up with a non-symmetric probability distribution for the scaled samples. The smaller the sample size, the larger the asymmetry. With the sample size approaching infinity, the observed probability distribution will become symmetric. You can ask WolframAlpha about it, or you can write a simple script that samples 32 values from a Gaussian distribution, scales, and scores the resulting scaled pdf.

Anyway, this is why the IQ4_NL (and IQ4_XS, as well as the IQ2_K, IQ3_K quants from this repository) quant lookup tables are asymmetric (and not because I'm a moron who didn't know how to make a symmetric function). But, if you don't accept this for granted (you most likely don't), just go and replace kvalues_iq4nl in ggml-quants.c with your symmetric variant, and watch the disaster that ensues. You need to do it at a few more places because for some reason this table is not in ggml-common.h as it should be.

Just to be clear: I wasn't implying you had done anything wrong and merely showing something that I had noticed and spent a couple of hours playing with last year (which I never mentioned before as it wasn't clear it was of any use nor related to anything useful).

I'm sorry if I've come across badly as this isn't my intention - I've nothing to gain from any of this, but just find it interesting :) If you search my nick you can find similar posts by me on the now dead 2+2 forums (everything is on discord now sadly) on similar topics from 25+ years ago!

0 replies

ikawrakow · 2025-03-25T14:28:09Z

ikawrakow
Mar 25, 2025
Maintainer Author

@jukofyork Sorry if I have come across a bit harsh. But it is interesting stuff indeed, so we all can get passionate about it.

Anyway, attached is a very simple C++ program that illustrates the asymmetry of the scaled distribution. Here is what it does:

It picks $N$ random points, either uniformly in $[-1,1]$ or from a Gaussian distribution with $\sigma = 1$ (command line argument)
It finds the minimum and maximum values in the sample $x_{\rm min}$ and $x_{\rm max}$
It determines a scale such that the value with the larger absolute value is at -1. I.e., if $|x_{\rm min}| > |x_{\rm max}|$, then $s = -1/x_{\rm min}$, else $s = -1/x_{\rm max}$. It than takes the other extremum (the one with the lower absolute value), and computes $x_s = s x_{\rm other}$.
It repeats the above $M$ times and computes the average of the observed $x_s$

Here is a plot of the computed average as a function of sample size $N$. For a sample of just 2 points, the average is effectively zero. If the distribution of scaled values was symmetric, the average should be 1 (or very close to 1). We see that this is not the case. For a Gaussian distribution we are quite far away from the symmetric value of 1 that we expect for $N \to \infty$ even for $N = 32$ (the typical block size used in many k- and i-quants). I have used

g++ -O3 distr1.cpp
./a.out 1000 -32 >test1.out
./a.out 1000 -32 1 > test2.out

to generate the data in the graph (a negative sample size will cause the program to loop between 2 and the absolute value of the argument given).

distr1.cpp.gz

0 replies

ikawrakow · 2025-03-25T15:01:41Z

ikawrakow
Mar 25, 2025
Maintainer Author

Here is another very simple C++ program:

Pick $N$ random values
Sort them in increasing order. Let's the sorted values be $x_i$
If $|x_0| > |x_{N-1}|$, then $s = -1/x_0,\quad\tilde{x}_i = s x_i$
Else $s = -1/x_{N-1}$ and $\tilde{x}i = s x{N-1-i}$ (don't know why it doesn't show the equation correctly)
Compute the average of the scaled $\tilde{x}_i$ over a given number of samples.

With this, we get this graph. It looks very similar to what one gets by doing an actual block-wise quantization with non uniform values.

distr2.cpp.gz

0 replies

compilade · 2025-03-25T16:25:49Z

compilade
Mar 25, 2025

@ikawrakow

First of all, this is a nice piece of work, so congratulations!

Thank you. Your existing work on imatrix definitely made it easier to try this kind of weighted rounding algorithms on actual models. At first the idea only applied to ternarization with no ability to weigh the error: microsoft/BitNet#112.

Without imatrix, the existing quantization methods are strictly better than your PR as measured by perplexity

Right. I will consider reverting back to the existing quantization methods when imatrix is not used (although for Q3_K, I still think make_q3_quants has some problems when the sign of the absmax value is positive (according to the equirectangular projections, in that case it looks like almost exactly like what Q3_0 would (in the upper left part)), which could be fixed).

I was hoping the more exhaustive algorithms would always be better (since they are better at minimizing the weighted squared error), but when they optimize the wrong thing (when no imatrix is given) can be worse, except apparently for some models like Qwen2.5-Coder-3B-Instruct.

But I also suspect the default weights for the weighted rounding without imatrix could be improved (but at that point I guess I should only change what rounding algorithm is used if I find those better default weights (which I thought I did from the results of Qwen2.5-Coder-3B-Instruct, but apparently not in general)).

Aside: is there a generally better solution for the default importance weights (without imatrix)? (It seems the heuristics between quant types disagree: some use x[i] * x[i], others fabsf(x[i]), and others sqrtf(sum_x2/N) + fabsf(x[i]) (Note that I did read #140, I'm not questioning that these were better in practice in their respective cases))
I think this depends on the weighted rounding algorithm with which the weights are used (since the behaviors can be different).

IQ4_NL is hit-or-miss - sometimes slightly better, sometimes slightly worse, but overall not much of a difference apart from the 5X increase in quantization time

Strange, the increase in quantization time for IQ4_NL with imatrix is only slightly more than 2× for me, and close to none (1×) when no imatrix is provided. There is room for improvement in the performance of make_qkxh_nl_quants because I did not yet extensively profile it with perf except for a previously slower qsort-based version (which really was 5× slower).

And there are still some adjustments I did not try yet and which could improve both the time (by a noticeable factor) and perplexity (hopefully), which is to add the same "clamping protection" as my linear weighted rounding algorithms (e.g. in make_qkxh_quants, the inverse scales which would clamp the x[i] with the biggest w[i] * fabsf(x[i]) are not tried (since this did improve the PPL and KLD with imatrix for linear quants like Q3_K, Q4_0 and Q5_0)). But it might also not help in which case I'm considering reverting to the existing IQ4_NL quantization algorithm, even though it makes less satisfying equirectangular projections.

I value your feedback, which is why I'll try to improve on this point (or exclude the changes to IQ4_NL).

You do use Q8_0 for the output tensor, I think it would be better to also use Q8_0 for token embeddings when using --pure.

I do use Q8_0 for the token embeddings too in my tests. The example command I've included in the PR description does specify --token-embedding-type q8_0

$ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>

RMSE (or cosine similarity) are just surrogates, so finding a better solution does not automatically lead to better quantization quality.

Yeah, I did notice that. The search algorithms I've made can be adapted to other metrics (although that can also be said of the existing algorithms for k-quants, since they also use weighted squared error), as long as they can be calculated cumulatively.

I'd like to find better surrogates, and more exhaustive search algorithms which are not brute-force (yet still yield optimal-looking results) can help with that, even though for now minimizing weighted squared error on the model tensors doesn't quite match the actual thing we want to minimize (PPL and KLD), which makes your carefully tuned heuristics superior for now.

Case in point, Q3_K_M with your PR often has a higher PPL than the existing quantization, despite being clearly better with --pure

On which model(s) did you observe this? I'd like to reproduce this observation.

I have written elsewhere about the equivalence of PPL and KLD for an infinitely large test corpus, and about the superiority of PPL for a test corpus of limited size, so I will not repeat myself here.

Right, but the test corpus is not infinite, and for a small test corpus I actually find KLD faster for meaningful comparisions (because the ± error goes down faster than for ln(PPL(Q)/PPL(base)), and so sometimes when I'm not using a GPU I don't have to leave it running that long to know if a change is meaningful when tweaking some things).

But I agree PPL is more convenient for quickly comparing versions of quants of a lot of different models (because the logits files get big really fast), at least when using a GPU.

But it is interesting stuff indeed, so we all can get passionate about it.

Yes, totally agree! And technically I already got what I wanted out of these algorithms (even if they are not merged or not better), which is the very nice plots they can make to hopefully help me understand a bit more the representable vector space of both linear and non-linear quants, especially when viewed appropriately in a 360 degree panorama viewer: https://blobs.compilade.net/pannellum.htm#panorama=equirectangular-iq4nl-qkxs-2048.png.

0 replies

ikawrakow · 2025-03-25T16:53:43Z

ikawrakow
Mar 25, 2025
Maintainer Author

Aside: is there a generally better solution for the default importance weights (without imatrix)? (It seems the heuristics between quant types disagree: some use x[i] * x[i], others fabsf(x[i]), and others sqrtf(sum_x2/N) + fabsf(x[I])

It is a heuristic. Trial and error. IIRC, higher bpw quants do better with a stronger large magnitude weighting (e.g., $x^2$), with lower bpw $|x|$ or similar is generally better.

On which model(s) did you observe this? I'd like to reproduce this observation.

Go back to the basics. Start with LLaMA-v1-7B. I know, nobody uses that today. But then again, almost all of k-quants development was based on the experience with the LLaMA-v1 models, and k-quants have done surprisingly well in the almost two years since they were released on the thousands of models they have been tried on. Even today when I want to try a new quantization idea, I always check performance with LLaMA-v1, LLaMA-v2, and Mistral-7B. Your IQ4_NL doesn't do very well on LLaMA-v1-7B - without an imatrix it arrives at a PPL higher than Q4_0.

Strange, the increase in quantization time for IQ4_NL with imatrix is only slightly more than 2× for me,

Oh, I used ik_llama.cpp to compare. It is possible that has become much faster than mainline (I haven't used mainline for quite some time). I started testing with DeepSeek-Lite, and almost gave up (your IQ4_NL quantization took 302.5 seconds with imatrix). ik_llama.cpp does it in 54.5 seconds.

6 replies

ikawrakow Mar 26, 2025
Maintainer Author

IIRC:
At some point I was annoyed by the slow quantization speed of quantization types with non-linear grids (IQ4_XS, IQ4_NL in mainline, here also IQ2_KS, IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K). The major bottleneck turned out to be finding the bin in which a value falls after scaling. E.g., this function in mainline, which does a binary search to find the bin. So, I replaced that with functions such as this one. I think that was the major part. I don't remember if I did additional optimizations and what they were, if any. I would have to go through the old PRs to find out.

compilade Mar 26, 2025

@bartowski1182

(EDIT: sorry, I did not see ikawrakow's answer before commenting)

My guess would be that best_index_iq4nl is faster than best_index_int8:

ik_llama.cpp/ggml/src/ggml-quants.c

Lines 14518 to 14533 in a22250d

    
           static const int8_t iq4nl_index[241] = { 
        
                0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0, 16, 16,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1,  1, 
        
                1, 17, 17,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2,  2, 18,  3,  3,  3,  3,  3,  3,  3,  3,  3,  3, 
        
                3,  3,  3,  3,  3,  3, 19,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4,  4, 20,  5,  5,  5,  5,  5,  5,  5,  5,  5,  5, 
        
                5,  5, 21, 21,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6,  6, 22,  7,  7,  7,  7,  7,  7,  7,  7,  7,  7, 23, 23,  8,  8,  8,  8, 
        
                8,  8,  8,  8,  8,  8, 24,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9,  9, 25, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 26, 26, 
        
               11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 27, 27, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 28, 13, 13, 13, 
        
               13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 13, 29, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 
        
               14, 14, 14, 14, 30, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15, 15 
        
           }; 
        
           static inline int best_index_iq4nl(const int8_t * values, float x) { 
        
               int ix = (int)x - values[0]; 
        
               if (ix < 0 || ix >= 241) return ix < 0 ? 0 : 15; 
        
               ix = iq4nl_index[ix]; 
        
               return ix < 16 ? ix : x - values[ix-16] < values[ix-15] - x ? ix-16 : ix-15; 
        
           }

And best_index_int8 does lots of comparisons instead of using a lookup table more directly (doesn't seem to render inline since it's from a different repo (mainline llama.cpp)):

https://github.com/ggml-org/llama.cpp/blob/2447ad8a981253a2b8e9f4b31cc8e7fdff83423e/ggml/src/ggml-quants.c#L4562-L4571

I will check if (and how) best_index_iq4nl affects the equirectangular projection of IQ4_NL, since that seems relevant.
(EDIT: it doesn't seem to change anything at a cursory glance. So it is pretty much equivalent.)

ikawrakow Mar 26, 2025
Maintainer Author

Here some napkin math: @compilade said that their approach is only 2X slower than the master branch in mainline. If I use the DeepSeek-Lint values, it means mainline will quantize it in 150 seconds instead of 300 seconds. If you add this optimization, it will become 50 seconds (using round values to make it easier to follow). You then add 150 seconds for the heap search, and it becomes 200 seconds. So, 4X slower than ik_llama.cpp, but only ~30% slower than the current state of mainline.

compilade Mar 26, 2025

@ikawrakow My implementation (with the cumulative search) unfortunately cannot use this optimization, because it doesn't use best_index_int8 anyway. The reason my implementation is slow is because it's too exhaustive. It calculates sumqx and sumq2 for all scales which would result in a distinct quantization, and it tests both signs. That is (32*(7+8))+1 = 481 distinct scales compared per block of 32, compared to the (2*7+1)+1 = 16 scales compared by the implementations which use either best_index_int8 or best_index_iq4nl.

It's nice that it's not 481/16 = 30 times slower, though 6× does seem too slow, I agree.

The only ways to make the cumulative search faster is to reduce how many scales it searches (which for linear quants is easier because more of them are equivalent and can be skipped), or to make the cumulative step faster.

(It might be possible to mix both approaches to search for more than 16 scales at 1× speed (or faster))

bartowski1182 Mar 26, 2025

Appreciate the insights, thanks!

ikawrakow · 2025-03-28T09:36:09Z

ikawrakow
Mar 28, 2025
Maintainer Author

@compilade @bartowski1182

ubergarm
Mar 29, 2025

While not directly related to the quants specific to #295 , I did just release what may be one of the best quants (for generation quality) in its size class for V3-0324 on huggingface ubergarm/DeepSeek-V3-0324-GGUF cooking with ik_llama.cpp. It also still fits 32k context in under 24GB VRAM and can hit over 4 tok/sec tg mmap'ing on my 9950x 96GB + 3090TI 24GB VRAM rig using -ser 6,1 sacrificing minimal perplexity.

It only works with ik_llama.cpp as even with experimental mainline PRs fairydreaming:deepseek2-mla-exp and sl/custom-tensor-offload you still need support for IQ3_K_R4/IQ2_K_R4 which is only available here.

I haven't done full perplexity and benchmarking comparisons across the major quant cookers versions, but have a rough table showing the differences between ubergarm, @bartowski1182, @danielhanchen (unsloth), and eventually mradermacher's recipes. I'll add it in the fold here for convenience.

Big thanks to y'all doing so much inspirational work and making this stuff more and more accessible!

👇

👈 V3-0324 quant recipe comparison table

	ubergarm/DeepSeek-V3-0324-IQ2_K_R4	bartowski/DeepSeek-V3-0324-Q2_K_L	unsloth/DeepSeek-V3-0324-UD-Q2_K_XL	mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K
Overview
`tensor_count`	267	190	253
`kv_count`	53	53	49
`split.tensors.count`	1147	1025	1025
`token_embd.weight`	`Q8_0`	`Q8_0`	`Q4_K`
File Size (GiB)	227	228	231
Multi-Head Latent Attention
`blk.*.attn_kv_b.weight`	`Q8_0`	n/a	n/a	n/a
`blk.*.attn_k_b.weight`	`Q8_0`	n/a	n/a	n/a
`blk.*.attn_v_b.weight`	`Q8_0`	n/a	n/a	n/a
Dense Layers
`blk.[0-2].attn_kv_a_mqa.weight`	`Q8_0`	`Q2_K`	`Q6_K`
`blk.[0-2].attn_kv_a_norm.weight`	`F32`	`F32`	`F32`
`blk.[0-2].attn_kv_b.weight`	`Q8_0`	`Q2_K`	`Q6_K`
`blk.[0-2].attn_norm.weight`	`F32`	`F32`	`F32`
`blk.[0-2].attn_q_a.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[0-2].attn_q_a_norm.weight`	`F32`	`F32`	`F32`
`blk.[0-2].attn_q_b.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[0-2].ffn_down.weight`	`Q8_0`	`Q3_K`	`Q6_K`
`blk.[0-2].ffn_gate.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[0-2].ffn_norm.weight`	`F32`	`F32`	`F32`
`blk.[0-2].ffn_up.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[0-2].attn_output.weight`	`Q8_0`	`Q3_K`	`Q4_K`
Shared & Routed MoE Layers
`blk.[3-60].attn_kv_a_mqa.weight`	`Q8_0`	`Q2_K`	`Q6_K`
`blk.[3-60].attn_kv_a_norm.weight`	`F32`	`F32`	`F32`
`blk.[3-60].attn_kv_b.weight`	`Q8_0`	`Q2_K`	`Q6_K`
`blk.[3-60].attn_norm.weight`	`F32`	`F32`	`F32`
`blk.[3-60].attn_q_a.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[3-60].attn_q_a_norm.weight`	`F32`	`F32`	`F32`
`blk.[3-60].attn_q_b.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[3-60].exp_probs_b.bias`	`F32`	`F32`	`F32`
`blk.[3-60].ffn_down_exps.weight`	`IQ3_K_R4`	`Q3_K`	`Q3_K`
`blk.[3-60].ffn_down_shexp.weight`	`Q8_0`	`Q3_K`	`Q6_K`
`blk.[3-60].ffn_gate_exps.weight`	`IQ2_K_R4`	`Q2_K`	`Q2_K`
`blk.[3-60].ffn_gate_inp.weight`	`F32`	`F32`	`F32`
`blk.[3-60].ffn_gate_shexp.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[3-60].ffn_norm.weight`	`F32`	`F32`	`F32`
`blk.[3-60].ffn_up_exps.weight`	`IQ2_K_R4`	`Q2_K`	`Q2_K`
`blk.[3-60].ffn_up_shexp.weight`	`Q8_0`	`Q2_K`	`Q4_K`
`blk.[3-60].attn_output.weight`	`Q8_0`	`Q3_K`	`Q4_K`
Important Matrix & Perplexity
`imatrix.dataset`	`calibration_data_v5_rc.txt`	`calibration_datav3.txt`	n/a	?
Final PPL (wiki.test.raw)	3.5614 +/- 0.02001	?	?	?

☝️

18 replies

saood06 Apr 2, 2025
Collaborator

I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area

llama.cpp main: 3.9012

my fork: 3.6868

considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!

I'm not ubergarm, but thank you for this, I'm always curious to see PPL numbers and this is interesting.

ubergarm Apr 2, 2025

@ubergarm I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area

llama.cpp main: 3.9012

my fork: 3.6868

considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!

Hey that is a nice drop in PPL for 1% size increase! Ohh sweet I see your new Q2_K_L-V2 variant! I wouldn't say mine is "better" given removing some weight in the GPU tensors possibly allows yours to run 64k context in under 24GB VRAM (which mine only fits 32k).

Also interesting that suddenly today mainline llama.cpp merged in -ot support!. Curious what they will do with MLA support.

Cheers!

bartowski1182 Apr 3, 2025

Opened the PR here:

ggml-org/llama.cpp#12727

that Q2_K_L-V2 will be replaced with a SLIIIIGHTLY better one probably tomorrow, but it's basically the same overall, just a few small bumps for another couple hundred mb

danielhanchen Apr 3, 2025

Oh hi! I didn't expect to be tagged - @bartowski1182 you're more than welcome to use the llama.cpp fork I have :)

@ikawrakow Much apologies if people are mis-representing I "invented" dynamic quants, which is far from the truth. Appreciate the work you do, and keep it up - and ignore all the haters - your code is great!

@ubergarm Great work on the quant as well! I was planning to do imatrix for all quants from now on, but I'm still trying to get the calibration dataset done specifically for instruct models - reasoning models are also a bit more complex.

danielhanchen Apr 3, 2025

It was actually pure coincidence on making the dynamic quants for DeepSeek R1, V3, since unfortunately as @ikawrakow mentioned, llama.cpp also quantizes the shared experts and dense layers the same as the rest of the model - my changes are at https://github.com/unslothai/llama.cpp/

But the main motivation for "dynamic quants" was due to bitsandbytes and vLLM for finetuning, not actually llama.cpp as @bartowski1182 mentioned. For eg in Gemma 3, I did both activation and weight error analysis to see which parts to quantize / not quantize:

saood06 · 2025-04-11T03:06:19Z

saood06
Apr 11, 2025
Collaborator

@danielhanchen

For Maverick you reported hitting this over protectiveness issue in llama.cpp

We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration

That issue has been addressed here in #202 but you may need to adjust it to allow 10% missing to get the blk.1 tensors as well (but block 45 is below 50% which seems very odd).

0 replies

On @compilade's PR 12557 and @jukofyork's quantization ideas #288

Uh oh!

ikawrakow Mar 25, 2025 Maintainer

@compilade's PR

@jukofyork's ideas

Replies: 8 comments · 24 replies

Uh oh!

@jukofyork's ideas

Uh oh!

Uh oh!

ikawrakow Mar 25, 2025 Maintainer Author

Uh oh!

ikawrakow Mar 25, 2025 Maintainer Author

Uh oh!

Uh oh!

ikawrakow Mar 25, 2025 Maintainer Author

Uh oh!

ikawrakow Mar 26, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow Mar 26, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ikawrakow Mar 28, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

saood06 Apr 2, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

saood06 Apr 11, 2025 Collaborator

ikawrakow
Mar 25, 2025
Maintainer

Replies: 8 comments 24 replies

ikawrakow
Mar 25, 2025
Maintainer Author

ikawrakow
Mar 25, 2025
Maintainer Author

ikawrakow
Mar 25, 2025
Maintainer Author

ikawrakow Mar 26, 2025
Maintainer Author

ikawrakow Mar 26, 2025
Maintainer Author

ikawrakow
Mar 28, 2025
Maintainer Author

saood06 Apr 2, 2025
Collaborator

saood06
Apr 11, 2025
Collaborator