Replies: 7 comments 25 replies
-
Here and bonus Qwen-235B.
PPL would work with these GGUF |
Beta Was this translation helpful? Give feedback.
-
up/gate experts with q2_k, down experts with q4_k, shared experts with q4_k, attention with a mix of q4_k and q5_0. Expect it to be far away from the Pareto frontier. |
Beta Was this translation helpful? Give feedback.
-
Nothing out of the ordinary so far... I'll be computing the ppl next.
|
Beta Was this translation helpful? Give feedback.
-
Unable to calculate the ppl. Not sure what's going on here with these "nan"...
|
Beta Was this translation helpful? Give feedback.
-
Ah I have a few experiences here. I got a bit of a funny system. 64gb of ram, a 3090 and a 2060 super. A while ago I realized that while we all know LLMs hallucinate, for "everyday assistant use" they might be much less reliable than people sometimes expect. I'm talking Granite 2B getting details about the Mongolian decimal system correct while messing up details about Napoleon. DeepSeek V3 and R1 as well as Kimi K2 messing up major life events of a major japanese celebrity, and even messing up Napoleons family. On a request for "proper authentic Brazilian food for my roadtrip" one of these super large models claimed a certain dish was authentic Brazilian when it's Argentinian. Llama 3.3 70B didn't have this problem. So I wanted to check this out with a proper local model before returning to DeepSeek and the more reliable proprietary models like GPT and Claude. ...to the point, I tried Intels AutoRound IQ2_K and Ubergarms IQ2_KL quants of the latest Qwen 235B. They both seemed comparable. Intels quant gave shorter responses but mostly preserved the factual quality of the fp16 original. For targets where the fp16 variant makes errors, the quantized variants end up making a few more errors, comparable to the errors the 32B models tend to make. Intels quant appears to, at least in the few preliminary tests I did, conserve the models ability to recall. The IQ2_KL quant was very similar, but made slightly different errors and sounded slightly different overall. I had one instance of this quant spitting out the wrong token completely. Intels quant appears to maybe be more predictable with the smart expert reduction? In either case, in my limited testing they appeared to be quite similar, at least with Qwen 235. |
Beta Was this translation helpful? Give feedback.
-
i tested there quant recent one Qwen3-235B-A22B-Thinking-2507-128x10B-Q2_K_S-00001-of-00002.gguf tbh it went very well in my test in lama.cpp but it didt work well in ik_llama. still testing though. will share if i find anythinking better |
Beta Was this translation helpful? Give feedback.
-
Thank you for your interest in our work. The current AutoRound algorithm for gguf is derived from the approach used in GuFF, but we've developed a better algorithm that will be released in the near future. As noted in the model card, our implementation utilizes different mixed-bits , which we believe contributes significantly to its improved performance. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Has anyone looked into https://github.com/intel/auto-round? I just saw it show up on my feed. Looks like they've been cooking for a little while. Has anyone tested their quants? I see they've recently added recipes for DeepSeek-R1-0528 - not too sure where to find those and how to evaluate them though - https://github.com/intel/auto-round/releases/tag/v0.6.0
Cc @ubergarm
For ref: https://x.com/haihaoshen/status/1948610166573990236 - "Intel AutoRound v0.6 released, featuring blocking scale quantization and model export to mainstream formats including GGUF, AWQ, GPTQ etc."
Beta Was this translation helpful? Give feedback.
All reactions