Testing `deepseek-ai/DeepSeek-V3-0324` model support. #286

ubergarm · 2025-03-24T18:47:47Z

ubergarm
Mar 24, 2025

I saw today a new model deepseek-ai/DeepSeek-V3-0324 that may run on this fork?

Zero pressure for anyone to spend time on this, just experimenting to satisfy my curiosity.

I figure might as well download it and see if it magically "just works" using my existing R1 custom quant procedure.

The main two issues I imagine might crop up without knowing anything:

Might need a special imatrix file (maybe this one from mradermacher for earlier V3 will still work?)
14B of the Multi-Token Prediction (MTP) Module weights

5.4.3. Multi-Token Prediction Evaluation
Instead of predicting just the next single token, DeepSeek-V3 predicts the next 2 tokens through
the MTP technique. Combined with the framework of speculative decoding (Leviathan et al.,
2023; Xia et al., 2023), it can significantly accelerate the decoding speed of the model. A natural
question arises concerning the acceptance rate of the additionally predicted token. Based on
our evaluation, the acceptance rate of the second token prediction ranges between 85% and 90%
across various generation topics, demonstrating consistent reliability. This high acceptance rate
enables DeepSeek-V3 to achieve a significantly improved decoding speed, delivering 1.8 times
TPS (Tokens Per Second). -https://arxiv.org/pdf/2412.19437

Well, I'll update this discussion after it finishes downloading and I give it the old college try haha...

Curious if anyone else has any luck and if this new model is "better" at coding like some are speculating over on r/LocalLlama... Who knows!

saood06 · 2025-03-24T22:03:22Z

saood06
Mar 24, 2025
Collaborator

I saw today a new model deepseek-ai/DeepSeek-V3-0324 that may run on this fork?
[...]
I figure might as well download it and see if it magically "just works" using my #258.

The config.json is the same (same architecture/same config) so ik_llama.cpp will behave the same (besides the updated weights, which affect output). This is just another finetune.

There are cases where finetuned model does change the config (see Qwen with the base being 128K , and the instruct tunes being only 32k with them recommending: "To handle extensive inputs exceeding 32,768 tokens, we utilize YaRN, a technique for enhancing model length extrapolation, ensuring optimal performance on lengthy texts.", but this is not one of those cases, and even in that cases the finetune did not change the architecture (which is what matters for conversion) just config.

The main two issues I imagine might crop up without knowing anything:

* Might need a special imatrix file (maybe [this one from mradermacher](https://huggingface.co/mradermacher/DeepSeek-V3-i1-GGUF/resolve/main/imatrix.dat) for earlier V3 will still work?)

* 14B of the Multi-Token Prediction (MTP) Module weights

For the first point, the linked imatrix will work but I do not recommended it as even though that imatrix was generated on the same model type and so it will apply, the model weights are different and that affects the imatrix data. (Edit: The mradermacher team is already working on quanting and imatrixing that model)

The second point, those weights were present in the other releases such as V3, V3-BASE, and R1, and the conversion just does not include them as llama.cpp and ik_llama.cpp both have do not support the MTP, it is a similar situation with what happened with the MLA tensors, where once support was added the conversion script was updated to include them which required reconverting.

Curious if anyone else has any luck and if this new model is "better" at coding like some are speculating over on r/LocalLlama... Who knows!

I'm curious, and will have to make room for it on my server. I know this is slightly off topic but I'd be curious to hear your experience with this (and any of the other Deepseek models you've tried).

8 replies

saood06 Mar 25, 2025
Collaborator

I'm half asleep and didn't see this reply it pretty late. I appreciate the encouragement and pointers to good existing discussions!

I got the new V3-0324 bf16 cranked out pretty quickly, but it didn't sink in that bin/llama-imatrix would have to run the full ~1.34TB model lmao...

You can just quantize to Q8_0 statically, and then use that for imatrix, which should finish a lot quicker, and since Deepseek is FP8 native, Q8_0 should be fine for imatrix (I know team mradermacher uses Q8_0 for these models, and in the past has done imatrix calculations on even smaller quants for other models, but that seems behind them for now [there is no indication of what quant was used on the model pages, and if that practice had continued I would have requested that be added], but they will requant models they have in the past and people have reported much better quants and this may play a part in it])

Have you decided on a calibration dataset?

I found some year old discussions that led me to this gist calibration_data_v5_rc.txt which has a mix of languages and code. No idea if it will work well for this big MoE. I could have gone with standard wiki.text.raw, but seems like using something is better than nothing. I'll be happy if it even works haha...

Ah that one, I'm familiar with it, and I think you made a good choice (comparing to the publicly available ones I know of, I know there are others like the team mradermacher that are better but aren't publicly available).

Interesting note on mradermacher's model card readme 👆

They always have that, they handle a LOT of quants and so they script the whole process including creating and updating README.md.

I have personally asked them about the status of this imatrix, which is where I got my information on their status with it from.

They RPC multiple machines together,

Wow, sounds like quite a chore to imatrix these big models. Oh yeah, I can see why, just got my first imatrix chunk in lol:

I agree it does, but fortunately for us they seem to enjoy doing it.

compute_imatrix: 779.93 seconds per pass - ETA 46 hours 8.75 minutes
[1]59.8267,[2]10.6927,
Huh it still seems to be reading mmap off of this slower disk into cache and is barely hitting 20% total CPU utilization so hopefully it speeds up a bit more haha...

I do too, and any chance you would be willing to post the resulting imatrix.dat file which will be just ~1 GB? I still will probably use the mradermacher one, ~~but yours will have the additional MLA tensor which I might be able to merge into theirs~~, but it would be fun to size golf the smallest functional Deepseek model and see how much the quality of the imatrix matters on a model that small.

Okie, gotta sleep, exciting times!

I agree, I have a lot of theories about what they will do with Deepseek-R2. I really like the model, but reading their papers they have done an amazing job at optimizations when it comes to get the most out of the hardware and on the choices of model architecture (MLA, MoE with a good amount of experts [I can't say it's a lot when this exists], a shared expert [qwen 3 seems to be dropping this for their MoE which is interesting], etc.) , but the actual RL tuning seems like there are a LOT of low hanging fruit and obvious and large improvements that can be done.

Edit: Corrected mistake about imatrix

ubergarm Mar 25, 2025
Author

You can just quantize to Q8_0 statically, and then use that for imatrix

Ahh, that is good news, running across both CPU sockets NUMA nodes is not performant to fit the whole bf16 haha... You asked in another thread about how it went. I had to quickly restart it due to forgetting to set directory permissions to write the imatrix.dat file, and that second time it estimated 11 hours. I killed it before finishing though after reading more of these notes.

Incomplete imatrix logs

compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 313.572 ms
compute_imatrix: computing over 213 chunks with batch_size 512
compute_imatrix: 200.99 seconds per pass - ETA 11 hours 53.50 minutes
[1]59.8267,[2]10.6927,[3]5.8694,[4]3.7855,[5]2.9690,[6]2.5103,[7]2.2235,[8]2.0239,[9]1.9107,
save_imatrix: entry '             blk.25.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**
save_imatrix: entry '             blk.12.ffn_down_exps.weight' has partial data (99.61%) 1 out of 256 experts are missing data Storing **but be aware**

save_imatrix: stored collected data after 10 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[10]1.8245,[11]2.0340,[12]2.0895,[13]2.1034,[14]2.1467,[15]2.0421,[16]1.9542,[17]1.8831,[18]1.8202,[19]1.7779,
save_imatrix: stored collected data after 20 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[20]1.7348,[21]1.7019,[22]1.6643,[23]1.6350,[24]1.6226,[25]1.6104,[26]1.5849,[27]1.6841,[28]1.7585,[29]1.8246,
save_imatrix: stored collected data after 30 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[30]1.8229,[31]1.8362,[32]1.8357,[33]1.8132,[34]1.8491,[35]1.8247,[36]1.8247,[37]1.8135,[38]1.8235,[39]1.8101,
save_imatrix: stored collected data after 40 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[40]1.7868,[41]1.7635,[42]1.7438,[43]1.7319,[44]1.7185,[45]1.7052,[46]1.7007,[47]1.6944,[48]1.6837,[49]1.6732,
save_imatrix: stored collected data after 50 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[50]1.6671,[51]1.6644,[52]1.6646,[53]1.6693,[54]1.6833,[55]1.6800,[56]1.6701,[57]1.6783,[58]1.6811,[59]1.6924,
save_imatrix: stored collected data after 60 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[60]1.6872,[61]1.7256,[62]1.7581,[63]1.7904,[64]1.8218,[65]1.8703,[66]1.8824,[67]1.9172,[68]1.9465,[69]2.0022,
save_imatrix: stored collected data after 70 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[70]2.0549,[71]2.0852,[72]2.1154,[73]2.1277,[74]2.1428,[75]2.1718,[76]2.2021,[77]2.2196,[78]2.2177,[79]2.2324,
save_imatrix: stored collected data after 80 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[80]2.2556,[81]2.2916,[82]2.3254,[83]2.3361,[84]2.3665,[85]2.3747,[86]2.3745,[87]2.4037,[88]2.4361,[89]2.4919,
save_imatrix: stored collected data after 90 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-r
c.dat
[90]2.5123,[91]2.5145,[92]2.5212,[93]2.5367,[94]2.5471,[95]2.5800,[96]2.5691,[97]2.6079,[98]2.6339,[99]2.6236,
save_imatrix: stored collected data after 100 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-
rc.dat
[100]2.6563,[101]2.7033,[102]2.7351,[103]2.7763,[104]2.8043,[105]2.8335,[106]2.8704,[107]2.8624,[108]2.8809,[109]2.8875,
save_imatrix: stored collected data after 110 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-
rc.dat
[110]2.8934,[111]2.8903,[112]2.9198,[113]2.9459,[114]2.9543,[115]2.9385,[116]2.9127,[117]2.9070,[118]2.9173,[119]2.9029,
save_imatrix: stored collected data after 120 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-
rc.dat
[120]2.8798,[121]2.8762,[122]2.8762,[123]2.8841,[124]2.8896,[125]2.8964,[126]2.9037,[127]2.9059,[128]2.9361,[129]2.9503,
save_imatrix: stored collected data after 130 chunks in /mnt/ai/models/deepseek-ai/DeepSeek-V3-0324-bf16-GGUF/imatrix-ubergarm-DeepSeek-V3-0324-calibration-data-v5-
rc.dat
[130]2.9259,[131]2.9021,[132]2.8789,[133]2.8561,[134]2.8580,[135]2.8584,[136]2.8844,[137]2.9166,[138]2.9356,^C^C

# only ~130 MiB after 8 hours or so...
$ ls -la imatrix-ubergarm-DeepSeek-V3-0324-bf16-calibration-data-v5-rc.dat
135382908 Mar 25 13:19 imatrix-ubergarm-DeepSeek-V3-0324-bf16-calibration-data-v5-rc.dat

it would be fun to size golf the smallest functional Deepseek model

Hah, yeah, I'm too am wondering which of the non MoE layers I can shrink down from q8_0 a bit to free up enough space to fit 64k context in under 24GB VRAM along with them all using -ot exps=CPU. Yes, if I can get a valid imatrix.dat I'm happy to upload it onto huggingface along with all details to re-create including what fork/git sha/data file used etc.

Will see how much I can get through today, and I am out of office next couple days. Could leave imatrix running probably if there is a special llama fork to use as you referenced or if the input file is not enough chunks to give the ~1GiB dat file (tbh I'm just learning how it even works so just winging it lol).

saood06 Mar 25, 2025
Collaborator

You can just quantize to Q8_0 statically, and then use that for imatrix

ETA 11 hours 53.50 minutes

A lot faster than I expected.

it would be fun to size golf the smallest functional Deepseek model

Hah, yeah, I'm too am wondering which of the non MoE layers I can shrink down from q8_0 a bit to free up enough space to fit 64k context in under 24GB VRAM along with them all using -ot exps=CPU.

IQ6_K is a very good quant, would be worth experimenting with.

Yes, if I can get a valid imatrix.dat I'm happy to upload it onto huggingface along with all details to re-create including what fork/git sha/data file used etc.

Thank you

Could leave imatrix running probably if there is a special llama fork to use as you referenced.

I recommend you to stick to this repo, the team mradermacher have very specialized needs and thus need to track llama.cpp's bleeding edge religiously, they took a fix ikawrakow wrote to fix an issue they were seeing, and just ported that over to llama.cpp alongside an extra example that allows you to calculate exact footprint required so they can do automated job scheduler that is resource aware.

bartowski1182 Apr 1, 2025

@ubergarm I wouldn't give too much thought to the imatrix dataset, there have been a lot of people recently who have tried iterating and experimenting on the one that I use, in particular related to different languages, and found shockingly minimal (if any) impact on the results of a target language by including that language in the dataset.

it seems clear that, as Kalomaze suggested way way back, the randomness/diversity of the data is much more important than the quality, because if ANYTHING was going to be altered by using a different imatrix set, surely it would be completely different languages.

for models the size of DeepSeek you can probably even go all the way down to Q4_K_M, I know mradermacher mentions going down to Q4_K_S, IQ3_XS or even Q2_K, and that was there before these monster models existed

that said, all this discussion about people with their massive xeon clusters and multiple servers RPCed together really tells me i need to find a sponsor.. 😂

saood06 Apr 2, 2025
Collaborator

@ubergarm I wouldn't give too much thought to the imatrix dataset, there have been a lot of people recently who have tried iterating and experimenting on the one that I use, in particular related to different languages, and found shockingly minimal (if any) impact on the results of a target language by including that language in the dataset.

This paper also confirms that https://arxiv.org/abs/2503.03592

"Further, the usage of importance matrices written in non-English does not significantly improve performance on non-English datasets and might in fact slightly harm it. However, this reduction in performance is not statistically significant."

it seems clear that, as Kalomaze suggested way way back, the randomness/diversity of the data is much more important than the quality, because if ANYTHING was going to be altered by using a different imatrix set, surely it would be completely different languages.

Yes, but I still tend toward team mradermacher's imatrix.dat because it is longer, and that matters a lot more in a model like deepseek where the calibration data is effectively spread out over the experts. I do think the difference is minimal (unless a catastrophic failure occurs but that would require a smaller dataset than yours).

ikawrakow · 2025-03-25T06:32:51Z

ikawrakow
Mar 25, 2025
Maintainer

Important

To calculate the imatrix, please do not use any of the mla, fa, fmoe or amb options. With these, some of the tensors will not get imatrix data collected.

As @saood06 pointed out, Q8_0 is good enough to collect imatrix data.

Also this #250 if you haven't seen it is obviously relevant to you,

This has been superseded by #259. The additional 2 tensors needed for MLA (attn_k_b and attn_v_b) are computed on the fly from attn_kv_b when loading the model (if missing). So, the best strategy is to use standard attention for imatrix calculations, which will give imatrix data to attn_kv_b, so this tensor will get a better quantization. attn_k_b is a transposed version of half of attn_kv_b. It gets computed by converting attn_kv_b to fp32, transposing that, and then quantizing to Q8_0, so (nearly) lossless. attn_v_b is just a view of the other half of attn_kv_b, so it uses the attn_kv_b data directly.

31 replies

saood06 Mar 27, 2025
Collaborator

Interesting, I use mikupad which is really nice, but ...

Oh nice, a single html sounds cool.

I actually use the optional server. This way I have access to chat history on all my devices, and the browser is spared storing it (my current DB file is over 8 GB).

I want to re-write my little dchat.py app to remove litellm dependency and simply use async http directly as it is such a thin layer and I would prefer to have more transparency.

That sounds nice. Newer builds of llama.cpp and ik_llama.cpp may differ in some ways, see lmg-anon/mikupad#104 and some of the other issues in the mikupad repo.

It uses a simple status bar enlighten and deepseek-tokenizer to dynamically update tok/sec estimate on the client using async streaming response.

Mikupad also roughly calculates and displays tok/sec which is nice.

You may want to look at how mikupad leverages the llama-server's tokenizer and detokinizer endpoints here

I'd like to add primp directly to it, which I use for my "agentic" stuff like web search and scraping - it delivers fairly clean markdown ready to feed to LLMs.

Sounds interesting, when you have something do you mind sharing the source in some way?

also did you also follow the recommendation of removing them after the round

Yeah definitely important. I use a naieve re.compile(r"<think>(.*?)</think>", re.IGNORECASE | re.DOTALL) to rip it out as the client keeps track of the chat thread. Works great unless I'm having it try to refactor itself lol...

Mikupad has a find and replace that can take a regex so I do about the same, but just manually before sending the next reply as I often do edit the think and response sections of a reply as they are happening.

Unrelated, I got my quant downloaded and running locally on the 9950x 96GB RAM + 3090TI 24GB VRAM box with initial test showing almost 2 tok/sec pp and over 4 tok/sec tg (note using -ser):

Nice, PP being slower than TG is odd. Is that because of the ser?

Gotta head out for a night or two, hope to leave a test running and possibly check in via laptop to track updates. Cheers and curious to hear how your iq4 works out!

It finished.

llama_model_quantize_internal: model size  = 680237.97 MB
llama_model_quantize_internal: quant size  = 364082.97 MB

main: quantize time = 13350534.07 ms
main:    total time = 13350534.07 ms

Thanks, I'll let you know my experience with it.

Edit: Performance is lower for this mix vs my first (and fastest) R1 mix, I do think it is almost certainly because I did make this mix a bit bigger, but looking into if the runtime computed tensors in #259 may be loaded in a way that is not ideal for my system, I could maybe try loading them into my mmap buffer type from #290.

First mix of V3_0324:
(
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 246 tensors
llama_model_loader: - type iq4_k_r4: 357 tensors
llama_model_loader: - type iq5_k_r4: 61 tensors
llm_load_print_meta: model params = 671.026 B //this is lower because of MLA tensor exclusion
llm_load_print_meta: model size = 355.550 GiB (4.551 BPW)
llm_load_print_meta: repeating layers = 353.716 GiB (4.541 BPW, 669.173 B parameters)
)

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	50.431	10.15	45.268	2.83
512	128	512	61.857	8.28	47.996	2.67
512	128	1024	62.828	8.15	49.111	2.61
512	128	1536	64.459	7.94	50.553	2.53
512	128	2048	72.170	7.09	53.913	2.37
512	128	2560	73.997	6.92	53.007	2.41

R1 fast mix for reference
(
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q5_0: 61 tensors
llama_model_loader: - type q5_K: 61 tensors
llama_model_loader: - type q6_K: 1 tensors
llama_model_loader: - type iq4_k: 1 tensors
llama_model_loader: - type iq4_k_r4: 662 tensors
llm_load_print_meta: model params = 672.050 B //this is higher because of MLA tensor inclusion
llm_load_print_meta: model size = 353.526 GiB (4.519 BPW)
llm_load_print_meta: repeating layers = 352.333 GiB (4.516 BPW, 670.196 B parameters)
)
:

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	49.636	10.32	39.574	3.23
512	128	512	57.011	8.98	43.246	2.96
512	128	1024	62.986	8.13	42.916	2.98
512	128	1536	63.400	8.08	44.014	2.91
512	128	2048	66.228	7.73	47.167	2.71
512	128	2560	72.508	7.06	46.553	2.75

Edit 2:

Comparing against another deep context run (where it took 2 hours to load ~26k tokens), it did TG and PP that far out better than my fast quant with a build from early Feb. The optimizations since then such as PP improvements on PP from MLA-3 mode with FA, and TG improvements, with FA helping as sweep bench showed the crossing over point at 8K where FA on is better) even though it is at a quant disadvantage.

I do want to make a fast quant (much closer to pure iq4_k_r4) and see how much better it is.

Edit 3: Made a pure IQ4_K_R4 mix using the team mradermacher imatrix. It is not functional (but it was fast).

Overall first impressions though, I do think R1 is better, but the performance benefits of not having thinking tokens, and not having to reprocess the prompt so often due to removing the thinking tokens, means I actually think the new V3 is useful to me. The same can't be said about the old V3 even though it also has those performance benefits.

ubergarm Mar 30, 2025
Author

You may want to look at how mikupad leverages the llama-server's tokenizer and detokinizer endpoints

Oh that is a nice feature, I didn't realize that endpoint existed! Good to know there may be some differences in the API endpoint as well. I'm happy to share the dchat.py after I get it to a place I'm happy enough to release it.

Nice, PP being slower than TG is odd. Is that because of the ser?

I don't think so, but I haven't tested. Basically I'm too impatient to do a proper llama-bench on my local rig, but anecdotally I've seen pp go up a bit more to 3-4 tok/sec in short prompts. Been using the faster remote servers mostly haha...

Made a pure IQ4_K_R4 mix

oh interesting, I was trying to follow the discusson about --pure, and found one of the original PRs introducing it on mainline a while back, but I'm honestly not sure that I would want to use it with R1 or V3 given it seems best to make attention higher quant than the experts rather than a single "pure" quant? Maybe I don't understand how it works, or it might apply more to dense models?

iq4_k_r4 and see how much better it is.

Yeah, that quant has my eye too for a good quality CPU only quant I had in mind... Maybe iq4_k_r4 for down_exps and iq3_k_r4 for (gate|up)_exps... Or what would be the next best size up from iq4_k_r4, possibly IQ5_K_R4 ? Hrmm... Yeah might try that with q8_0_r8 for all token embedding, attention, dense layers, and shared experts. Maybe can get fairly close to the full q8_0 perplexity Final estimate: PPL = 3.2454 +/- 0.01773 with more speed ideally.

the new V3 is useful to me

Yeah, agreed it is nice to just get the answer without all that thinking latency hah.. 🤞 Fingers crossed that R2 is magically better with the same architecture if they drop that soon hah...

saood06 Mar 30, 2025
Collaborator

I'm happy to share the dchat.py after I get it to a place I'm happy enough to release it.

Thank you, let me know whenever that is.

Nice, PP being slower than TG is odd. Is that because of the ser?

I don't think so, but I haven't tested. Basically I'm too impatient to do a proper llama-bench on my local rig, but anecdotally I've seen pp go up a bit more to 3-4 tok/sec in short prompts. Been using the faster remote servers mostly haha...

and TG is above 4? I gave ser 7,1 an attempt, I resumed a chat mid system reply and it couldn't finish it only giving gibberish, turned ser off and it worked like usual, maybe ser 7,0.4 might be more stable?

Made a pure IQ4_K_R4 mix

oh interesting, I was trying to follow the discusson about --pure, and found one of the original PRs introducing it on mainline a while back, but I'm honestly not sure that I would want to use it with R1 or V3 given it seems best to make attention higher quant than the experts rather than a single "pure" quant?

I've done many IQ4_K_R4 mixes and my personal favorites for my use cases are the ones closest to pure that have the fastest TG, the PPL benefits from straying away for me don't seem to match the value of IQ4_K_R4, which has really good quality/size and performance characteristics on my machine.

Maybe I don't understand how it works, or it might apply more to dense models?

I don't know, I've stuck with the standard recipes for other models, it's only deepseek where I've experimented a lot with mixes.

iq4_k_r4 and see how much better it is.

Yeah, that quant has my eye too for a good quality CPU only quant I had in mind... Maybe iq4_k_r4 for down_exps and iq3_k_r4 for (gate|up)_exps... Or what would be the next best size up from iq4_k_r4, possibly IQ5_K_R4 ?

#149 and #157 and #138 have performance metrics for some quants, and #293 has some info about IQ5_K_R4.

Hrmm... Yeah might try that with q8_0_r8 for all token embedding, attention, dense layers, and shared experts. Maybe can get fairly close to the full q8_0 perplexity Final estimate: PPL = 3.2454 +/- 0.01773 with more speed ideally.

If my near pure mix that is currently cooking is functional and fast, I wonder if it would have acceptably close PPL for you and also high speed on your CPU system.

Edit: It is broken, going to try again, also this may be worth looking at for you #141

the new V3 is useful to me

Yeah, agreed it is nice to just get the answer without all that thinking latency hah.. 🤞 Fingers crossed that R2 is magically better with the same architecture if they drop that soon hah...

It is, but if R2 is good enough I know I'll go back to dealing with the latency.

ubergarm Mar 30, 2025
Author

@saood06

First mix of V3_0324:
llama_model_loader: - type f32: 361 tensors
llama_model_loader: - type q8_0: 246 tensors
llama_model_loader: - type iq4_k_r4: 357 tensors
llama_model_loader: - type iq5_k_r4: 61 tensors
llm_load_print_meta: model params = 671.026 B //this is lower because of MLA tensor exclusion
llm_load_print_meta: model size = 355.550 GiB (4.551 BPW)
llm_load_print_meta: repeating layers = 353.716 GiB (4.541 BPW, 669.173 B parameters)

some info about IQ5_K_R4.

Hrmm, I see you used llama-sweep-bench on your "first mix", but did you ever check perplexity or try to inference with it?

Reason I'm asking is that I made a quant overnight using iq5_k_r4 and checking perplexity this morning it is very high (not NaN but possibly numerical instability) and also it doesn't inference correctly and just replies with AlrightAlrightAlrightAlright hah...

I've opened an issue about it to track relevant information easier, feel free to chime in if you have any thoughts. #296

It is broken, going to try again

Hrm, so your --pure mix didn't work? I'm curious how it broke and what you are changing to try again?

Also I noticed that python gguf-py/scripts/gguf_dump.py --markdown /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf doesn't have support for the new quant types so it barfs. I'll keep that in the back of my head for a rainy day to possibly try to update it. More of a convenience than anything else.

Thanks for sharing all your quant cooking experience and tips!

saood06 Mar 30, 2025
Collaborator

Hrmm, I see you used llama-sweep-bench on your "first mix", but did you ever check perplexity or try to inference with it?

Assuming you mean the V3_0324, I have not checked perplexity (and I haven't for any other V3_0324 mix), but I do use it for inference as it is my only quant of V3_0324 that functions for inference.

Also as I've been using V3 more, it feels like a distillation, where it lacks a lot of "breadth" or variety, in a way that I've only seen from distills before. I don't like it, if this continues I may end up back on R1.

I made all further mixes to try and improve speed (and decided to swap to using a different imatrix file).

Reason I'm asking is that I made a quant overnight using iq5_k_r4 and checking perplexity this morning it is very high (not NaN but possibly numerical instability) and also it doesn't inference correctly and just replies with AlrightAlrightAlrightAlright hah...

I've opened an issue about it to track relevant information easier, feel free to chime in if you have any thoughts. #296

I will reply over there.

It is broken, going to try again

Hrm, so your --pure mix didn't work? I'm curious how it broke and what you are changing to try again?

I went into more detail here and a few comments following that.

Also I noticed that python gguf-py/scripts/gguf_dump.py --markdown /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-CPU-IQ4_K_R4.gguf doesn't have support for the new quant types so it barfs.

Make an issue for it, I've looked into gguf-py before, so I might PR a fix for it when I can.

I'll keep that in the back of my head for a rainy day to possibly try to update it. More of a convenience than anything else.

Or you can make a PR yourself instead of an issue if you want to.

Thanks for sharing all your quant cooking experience and tips!

Thanks for doing the same, I would do more experiments but quanting takes time, and also hogs my server so I can't do inference or other things.

saood06 · 2025-03-25T15:51:30Z

saood06
Mar 25, 2025
Collaborator

@ubergarm

Just saw this "In our web and application environments, the temperature parameter $T_{model}$ is set to 0.3. " and they even go as far to encourage users to use that by "Thus, if you call V3 via API, temperature 1.0 equals to the model temperature 0.3.", so I think you might want to experiment with that temperature.

3 replies

ubergarm Mar 25, 2025
Author

Ahh, interesting, yeah R1 suggested default was 0.6 or somthing iirc.

Does specifying temperature matter for making the imatrix? Guessing it does not, so will continue trying to make imatrix with default command above.

But when I go to actually test a final quant, thanks for this important detail to set temp=0.3!

saood06 Mar 25, 2025
Collaborator

But when I go to actually test a final quant, thanks for this important detail to set temp=0.3!

Ya I'm in the middle of downloading. This model seems interesting to try out.

saood06 Mar 25, 2025
Collaborator

On this topic what are your preferred samplers? I use just temp, and min_p but this ggml-org/llama.cpp#11223 has caught my eye a bit (seems like it might be a slight improvement over min_p)

saood06 · 2025-03-25T19:07:02Z

saood06
Mar 25, 2025
Collaborator

14B of the Multi-Token Prediction (MTP) Module weights

@ikawrakow

Is this something you have looked into? I think even a basic implementation should offer 50% improvement.

There is also jukofyork who is making draft model's (see here) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.

3 replies

ikawrakow Mar 26, 2025
Maintainer

14B of the Multi-Token Prediction (MTP) Module weights

@ikawrakow

Is this something you have looked into? I think even a basic implementation should offer 50% improvement.

There is also jukofyork who is making draft model's (see here) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.

No, I haven't looked into how it works. I'm surprised MPT has not been implemented in mainline.

jukofyork Mar 31, 2025

There is also jukofyork who is making draft model's (see here) that can be used with llama.cpp's already existing generic drafting implementation, I'm watching that to see how much performance uplift people end up reporting on that.

@saood06 I haven't released anything yet as wasn't really happy with the results, but somebody linked me this paper:

https://arxiv.org/html/2411.11055v1

and I'm retrying after seeing this:

With 30% raw code data in the mix now.

saood06 Apr 1, 2025
Collaborator

@jukofyork

Thanks for the update.

ikawrakow · 2025-03-26T05:03:12Z

ikawrakow
Mar 26, 2025
Maintainer

[210]6447980.5077,[211]6475482.7036,[212]6484583.7694,[213]6476309.6415,

The imatrix computation that gave these final perplexity values is useless. It means mainline is not working with Q8_0 either for DeepSeek-V3 (the difference between a NaN PPL and a PPL of 6 million is marginal, if any).

7 replies

saood06 Mar 26, 2025
Collaborator

It looked like this is @ubergarm's imatrix run? It ran to completion with 213 chunks.

Yes and that run was on the dairy dreaming PR see below:

So I managed to build that fairydreaming/deepseek2-mla-exp@76543311 and have llama-perplexity running on the plain q8_0 I made with ik_llama.cpp.

ubergarm Mar 26, 2025
Author

Okay, using PR#291 I was able to compute an importance matrix on a V3-0324 static q8_0 quant. I made the bf16 GGUF using evshiron/llama.cpp as outlined in my notes from the original deepseek-ai fp8.

I'm not clear if this computes imatrix for the MLA tensors as well? If so, then would this be better to use than the bartowski imatrix computed on mainline?

Anyway, @saood06 if you are interested, I haven't had time to test it yet, but just uploaded it to ubergarm/DeepSeek-V3-0324-GGUF hf repo. I hope to eventually upload a quant or two that I like for this fork to that repo.

Perplexty value and partial logs from computing imatrix on PR#291 here

Cheers!

saood06 Mar 27, 2025
Collaborator

Anyway, @saood06 if you are interested, I haven't had time to test it yet, but just uploaded it to ubergarm/DeepSeek-V3-0324-GGUF hf repo. I hope to eventually upload a quant or two that I like for this fork to that repo.

Thanks, I would have used your imatrix over bartowski as I think your dataset is better, but I just finished up the quant and don't feel like making another. Once team mradermacher uploads one I may end up making additional quants using both theirs and yours.

Also the forum link on your huggingface readme from L1T caught my eye, I used to hang around there a good amount, haven't in a while, I should go back.

ubergarm Mar 29, 2025
Author

Thanks, I would have used your imatrix over bartowski as I think your dataset is better, but I just finished up the quant and don't feel like making another. Once team mradermacher uploads one I may end up making additional quants using both theirs and yours.

So I did manage to do a comparison against both imatrix datasets by making two otherwise identical quants and comparing perplexity against wiki.text.raw: here

They are pretty close, and bartowski's started off better in the beginning, but the final value the new one I used was slightly better which was interesting.

Also, I finished and uploaded my V3-0324 quant and did a comparison across top quant cookers recipes over in this discussion

The other tip I saw was by unsloth in r/LocalLLama post suggesting turn down temp to 0 and min-p to 0.01 when generating code or math. I've seen folks anecdotally suggesting V3-0324 hallucinates more but might just be the default temps are too high, not sure.

saood06 Mar 30, 2025
Collaborator

So I did manage to do a comparison against both imatrix datasets by making two otherwise identical quants and comparing perplexity against wiki.text.raw: here

Nice, thanks for the additional data point on imatrix dataset quality.

Also, I finished and uploaded my V3-0324 quant and did a comparison across top quant cookers recipes over in #288 (comment)

I'm working on making my 3rd quant of V3-0324 (a lot more info on my V3-0324 quants here

Testing deepseek-ai/DeepSeek-V3-0324 model support. #286

Uh oh!

Uh oh!

ubergarm Mar 24, 2025

Replies: 5 comments · 52 replies

Uh oh!

Uh oh!

saood06 Mar 24, 2025 Collaborator

Uh oh!

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

Uh oh!

ubergarm Mar 25, 2025 Author

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

bartowski1182 Apr 1, 2025

Uh oh!

saood06 Apr 2, 2025 Collaborator

Uh oh!

ikawrakow Mar 25, 2025 Maintainer

Uh oh!

Uh oh!

saood06 Mar 27, 2025 Collaborator

Uh oh!

ubergarm Mar 30, 2025 Author

Uh oh!

Uh oh!

saood06 Mar 30, 2025 Collaborator

Uh oh!

Uh oh!

ubergarm Mar 30, 2025 Author

Uh oh!

Uh oh!

saood06 Mar 30, 2025 Collaborator

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

ubergarm Mar 25, 2025 Author

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

Uh oh!

saood06 Mar 25, 2025 Collaborator

Uh oh!

ikawrakow Mar 26, 2025 Maintainer

Uh oh!

jukofyork Mar 31, 2025

Uh oh!

saood06 Apr 1, 2025 Collaborator

Uh oh!

ikawrakow Mar 26, 2025 Maintainer

Uh oh!

saood06 Mar 26, 2025 Collaborator

Uh oh!

Uh oh!

ubergarm Mar 26, 2025 Author

Uh oh!

saood06 Mar 27, 2025 Collaborator

Uh oh!

Testing `deepseek-ai/DeepSeek-V3-0324` model support. #286

ubergarm
Mar 24, 2025

Replies: 5 comments 52 replies

saood06
Mar 24, 2025
Collaborator

saood06 Mar 25, 2025
Collaborator

ubergarm Mar 25, 2025
Author

saood06 Mar 25, 2025
Collaborator

saood06 Apr 2, 2025
Collaborator

ikawrakow
Mar 25, 2025
Maintainer

saood06 Mar 27, 2025
Collaborator

ubergarm Mar 30, 2025
Author

saood06 Mar 30, 2025
Collaborator

ubergarm Mar 30, 2025
Author

saood06 Mar 30, 2025
Collaborator

saood06
Mar 25, 2025
Collaborator

ubergarm Mar 25, 2025
Author

saood06 Mar 25, 2025
Collaborator

saood06 Mar 25, 2025
Collaborator

saood06
Mar 25, 2025
Collaborator

ikawrakow Mar 26, 2025
Maintainer

saood06 Apr 1, 2025
Collaborator

ikawrakow
Mar 26, 2025
Maintainer

saood06 Mar 26, 2025
Collaborator

ubergarm Mar 26, 2025
Author

saood06 Mar 27, 2025
Collaborator