Replies: 2 comments 1 reply
-
|
It's pretty difficult to say why 6-bit is better than bf16. It might just be luck. If you try a few different PRNG seeds do the results change much? That would be a good first step to see if it's just variance in the results. It's pretty unlikely but not impossible that there is a bug with bf16 that makes it worse than 6-bit. You could also check 5-bit and 8-bit to get a couple more data points. |
Beta Was this translation helpful? Give feedback.
-
|
Does luck apply that much to greedy decoding (temp=0)? Today I have converted bf16 to fp16, and fp16 permforms better, in next few days I wil do more tests, but I already did plenty with bf16 and 8bit, 6bit and 4bit dwq.. I will do more with fp16 too. Each run, I do on seeds like 1001, 1002, 1003, is it random enough? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been running a large set of benchmarks on Qwen3-30B-A3B (hybrid, no think), and I keep seeing a surprising pattern:
The 6-bit model outperforms the BF16 model in actual task accuracy.
With strict greedy decoding (temp=0, rep_pen=1), results are the same.
But as soon as I switch to more open sampling settings, the 6-bit version consistently does better.
This is counter-intuitive — BF16 should be the “clean”, full-precision reference.
So now I'm wondering:
Why is 6-bit beating BF16?
Is there some issue with the BF16, or qwen3_moe implemntation?
Probably I have seen on X that someone higlighted the same issue for a diffrent model?
Beta Was this translation helpful? Give feedback.
All reactions