Replies: 6 comments 30 replies
-
Has there been any QAT going on with the Qwen3 models? I didn't see anything mentioned in the linked blog post, but there are indications that QAT may have been involved. Does somebody know? |
Beta Was this translation helpful? Give feedback.
-
QAT used in Qwen3 training?After posting the above results, I decided to see what I get with OK, then, let's just redo all recipes, using
Oops. Recipes 5 and 6 have a lower PPL than the Hmm, my Oops. OK, it must be my imatrix. People have filled whole libraries writing about how the imatrix calibration data needs to be random, diverse, whatnot. OK, let's grab the Unsloth imatrix. Quantize, run Oops. That's definitely even less diverse than mine. Let's grab Bartowski imatrix. Quantize recipe IQK-6, run Oops. It looks like mine, obtained 100% from What happens if I don't use an imatrix at all? Quantize recipe So, that's about 2.6% quantization error, so in the range of what I would expect from
So, what if they have used some form of QAT targeted towards some Looking at this graph, it seems plausible that if Just in case, I also checked PPL for |
Beta Was this translation helpful? Give feedback.
-
Oh man I just released ubergarm/Qwen3-30B-A3B-GGUF/Qwen3-30B-A3B-mix-IQ4_K.gguf just before finding and reading this discussion!!! ooops! I have some PPL data from EDIT: Add unreleased
I have some more KLD and token probability stats too with graphs to make a better write-up eventually. So sounds like if Qwen was using QAT targeted at fp4, there it may be possible to use IQ4_KS to shave some weight without sacrificing quality? I'll have to try some more mixes... If I'm following here, sounds like the goal to get as low as possible without going below the bf16 PPL? So using |
Beta Was this translation helpful? Give feedback.
-
I have posted 3 |
Beta Was this translation helpful? Give feedback.
-
@ubergarm Great write up! The fact that the ikawrakow/IQ4_KS_Unsloth model gets a lower PPL than My only comment: when there is no doubt that the
In that scenario, the larger the difference between the |
Beta Was this translation helpful? Give feedback.
-
@ikawrakow i am having a hard time understanding how the iqx_k quants came from? is there an explanation somewhere other than the code |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I did some experimentation with Qwen3 quantization. As I don't have the horsepower to run the flagship model, I experimented with the Qwen3-30B-A3B MoE model. I'm reporting the results here, hopefully this could be useful also for Qwen3-235B-A22B.
The following graph shows a comparison between the Unsloth so called "dynamic" quants and the quantization mixes I prepared. The Unsloth quantized models, shown with black symbols, are from their HF repository, and the text in black besides the data points gives the corresponding file name. The red symbols are for my quantization mixes and their recipes will be given below. The x-axis is model size in GiB (and not GB, as HF likes to use). The y-axis is the quantization error in percent defined as
PPL(Q)/PPL(bf16)-1
. Based on these results, it does not look like Unsloth did a particularly good job with their "dynamic" quants for this model. One can get the same quantization quality with ~2 GiB smaller model, so nearly 20% smaller at the low-bpw end.My recipes are almost entirely composed of
IQK
quants, so exclusive to this repository. I did not go beyond 4.3 bpw as there the quantization error is 0.57% (and I have seen sub-1% quantization error to be called "lossless" in the quantization literature).Recipe IQK-1
Note that one can combine all arguments following
--custom-q
into a single, comma separated list of regular expressions. I have split into several--custom-q
arguments for better readability. So, basically, all attention tensors quantized withIQ5_K
, the first 6 layers offfn_down_exps
withIQ4_KS
, everything else withIQ2_KS
. Oh, here and for all other recipes, token embeddings areQ4_K
and the output tensor isQ6_K
. This quantized model ends up being 8.745 GiB, so only very slightly larger than Unsloth'sUD-IQ1_S
(8.396 GiB).Recipe IQK-2
Very similar to Recipe-1, with all attention tensors quantized with
IQ5_K
, the first 6 layers offfn_down_exps
withIQ4_KS
, all other experts withIQ2_K
. The quantized model ends up being 9.314 GiB.Recipe IQK-3
The difference to Recipe IQK-2 is that the first 6 layers of
ffn_down_exps
is quantized withIQ4_K
, the remainingffn_down_exps
tensors withIQ3_K
. The quantized model size is 10.389 GiB.Recipe IQK-4
Similar to Recipe IQK-3, but now the first 16 and the last 8 layers of the
ffn_up_exps
andffn_gate_exps
tensors are quantized withIQ3_K
. The quantized model size is 11.584 GiB.Recipe IQK-5
I.e., all experts are
IQ3_K
, except for the first 6 layers offfn_down_exps
, which areIQ4_K
. Model size is 12.779 GiBRecipe IQK-6
I.e., all tensors (except attention, output and embeddings) are
IQ4_KS
. The quantized model size is 15.454 GiB.Beta Was this translation helpful? Give feedback.
All reactions