ikawrakow · ThomasBaruzier · Jul 28, 2025
diff --git a/...nv variable for GGML_SCHED_MAX_COPIES_.md → ...env variable for GGML_SCHED_MAX_COPIES.md b/...nv variable for GGML_SCHED_MAX_COPIES_.md → ...env variable for GGML_SCHED_MAX_COPIES.md
@@ -1,26 +1,27 @@
-### 🗣️ [#100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
+## 🗣️ [Discussion #100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
 
 | **Author** | `Nexesenex` |
 | :--- | :--- |
+| **State** | ✅ **Open** |
 | **Created** | 2024-10-21 |
 | **Updated** | 2024-10-21 |
 
 ---
 
-#### Description
+## 📄 Description
 
 @ikawrakow, could you set up a CLI argument (or at least an env variable, it's much simpler I guess but I'm failing to do it right) to determine GGML_SCHED_MAX_COPIES without recompiling? It impacts VRAM occupation and performances, and it'd be great to set that up conveniently for benching and customized use.
 
 ---
 
-#### 🗣️ Discussion
+## 💬 Discussion
 
-👤 **ikawrakow** replied the **2024-10-21** at **08:29:25**:<br>
+👤 **ikawrakow** commented on **2024-10-21** at **08:29:25**
 
 I haven't looked into this at all. What is it good for?
 
 ---
 
-👤 **Nexesenex** replied the **2024-10-21** at **09:36:22**:<br>
+👤 **Nexesenex** commented on **2024-10-21** at **09:36:22**
 
 It's supposed to go faster inference on multi-GPU I guess. Mainline sets it at 4, I set it at 1, because I didn't notice much improvement back in the days, but I noticed more vram consumption and gpu load.
diff --git a/github-data/discussions/104 - Convenience improvements for llama-quantize.md b/github-data/discussions/104 - Convenience improvements for llama-quantize.md
@@ -1,13 +1,14 @@
-### 🗣️ [#104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
+## 🗣️ [Discussion #104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
 
 | **Author** | `Nexesenex` |
 | :--- | :--- |
+| **State** | ✅ **Open** |
 | **Created** | 2024-10-23 |
 | **Updated** | 2024-10-23 |
 
 ---
 
-#### Description
+## 📄 Description
 
 Hey IK.
 

diff --git a/...ssions/140 - Questions about weight_j_.md → ...cussions/140 - Questions about weightj.md b/...ssions/140 - Questions about weight_j_.md → ...cussions/140 - Questions about weightj.md
@@ -1,13 +1,14 @@
-### 🗣️ [#140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
+## 🗣️ [Discussion #140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
 
 | **Author** | `DavidZyy` |
 | :--- | :--- |
+| **State** | ✅ **Open** |
 | **Created** | 2024-12-13 |
 | **Updated** | 2025-02-11 |
 
 ---
 
-#### Description
+## 📄 Description
 
 Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions. 
 For example, at funtion `quantize_row_q4_0_impl` and other places,  `weight[j]` is: 
@@ -21,9 +22,9 @@ weight[j] = qw[j]
 
 ---
 
-#### 🗣️ Discussion
+## 💬 Discussion
 
-👤 **ikawrakow** replied the **2024-12-14** at **08:13:19**:<br>
+👤 **ikawrakow** commented on **2024-12-14** at **08:13:19**
 
 Hi @DavidZyy,
 
@@ -40,15 +41,15 @@ Why the need for correcting the Hessian in the first place?
 
 ---
 
-👤 **DavidZyy** replied the **2024-12-14** at **13:58:43**:<br>
+👤 **DavidZyy** commented on **2024-12-14** at **13:58:43**
 
 Thanks for taking time to answer this question and share information, I learned a lot from your answers.
 Yes, it's very interesting :)
 > (and it was amusing to observe people trying to make scientific sense out of it)
 
 ---
 
-👤 **jukofyork** replied the **2025-02-10** at **17:03:34**:<br>
+👤 **jukofyork** commented on **2025-02-10** at **17:03:34**
 
 Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation!
 
@@ -66,15 +67,16 @@ but I still suspect that for these new very-high-expert-MoEs it should really be
 
 ---
 
-👤 **ikawrakow** replied the **2025-02-10** at **18:07:55**:<br>
+👤 **ikawrakow** commented on **2025-02-10** at **18:07:55**
 
 @jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. [This repository](https://github.com/ikawrakow/mnist) uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the [ggml mnist example](https://github.com/ggml-org/ggml/tree/master/examples/mnist) by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., `IQ2_XXS` uses 256 out of 6561 points on the E8 lattice. This prevents overfitting, thus can be considered as "regularization". 
 
 The other thing I have learned is that theories are rarely useful in their pure form. More often than not, you start with this beautiful theory to only find that it does not work very well in practice. So, you start adding fudge factors, and things get better. And then you add even more fudge factors and it gets better. When you are done with it you have something that works really well, but you barely recognize the beautiful pure theory you started from.
 
 Just my 2 cents
 
-> 👤 **jukofyork** replied the **2025-02-10** at **19:26:00**:<br>
+> 👤 **jukofyork** replied on **2025-02-10** at **19:26:00**
+> 
 > > For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term.
 > 
 > I was late to that discussion, but it was possibly me who mentioned this.
@@ -138,8 +140,9 @@ Just my 2 cents
 > I am certainly no "Bayesian purist" and will happily tune the prior to get the best observed results too!
 > 
 > BUT: I strongly believe the effectiveness of the `imatrix` calculations could be vastly improved by adding some method of interpolation/regularisation/whatever to allow for informed tuning of the weighting factors! :smile:
+
+> 👤 **saood06** replied on **2025-02-10** at **20:23:18**
 > 
-> 👤 **saood06** replied the **2025-02-10** at **20:23:18**:<br>
 > > I still think this is an important area to consider (whatever the chosen regularization method is):
 > > #### (A) I see people still using using bartowski's same ~250kb `calibration_datav3.txt` file on `Deepseek-V3` as on fully-dense models.
 > > 
@@ -159,8 +162,9 @@ Just my 2 cents
 > From: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6758d52499eea0c4b65d0475
 > 
 > They do discuss the idea of needing more data because of MoE in that thread. I use their imatrix.dat files, and my ppl numbers I gave you are for IQ4_K_R4.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **06:01:32**
 > 
-> 👤 **ikawrakow** replied the **2025-02-11** at **06:01:32**:<br>
 > Is the inability to activate al experts observed just for layer 0 or for all layers?
 > 
 > Are people aware of the fact that one can run the model with more active experts than specified by the meta data?
@@ -170,8 +174,9 @@ Just my 2 cents
 > I think doing that will likely help activate more experts.
 > 
 > I also don't understand why the entire experts tensor cannot be imatrix-quantized if just one expert is missing. If that's what we ended up with, it definitely needs fixing.
+
+> 👤 **saood06** replied on **2025-02-11** at **15:17:30**
 > 
-> 👤 **saood06** replied the **2025-02-11** at **15:17:30**:<br>
 > > Is the inability to activate al experts observed just for layer 0 or for all layers?
 > 
 > Just layer 0.
@@ -201,19 +206,22 @@ Just my 2 cents
 > They never reported that for any of the Deepseek models so I'm assuming they only encountered it with arctic and no matter what they did they were never able to activate that expert so I'm giving some credence to their theory that "There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate."
 > 
 > Looking at the files in safetensors each expert is stored separately but with a GGUF that is not the case and they are all stored together.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **16:33:38**
 > 
-> 👤 **ikawrakow** replied the **2025-02-11** at **16:33:38**:<br>
-> Thanks for making me aware of this situation. I prepared PR #202 to deal with it.
+> Thanks for making me aware of this situation. I prepared PR [#202](https://github.com/ikawrakow/ik_llama.cpp/issues/202) to deal with it.
+
+> 👤 **ikawrakow** replied on **2025-02-11** at **17:11:08**
 > 
-> 👤 **ikawrakow** replied the **2025-02-11** at **17:11:08**:<br>
 > > but are you sure that is recommended? 
 > 
 > I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving.  A quick check with DeepSeek-Lite (6 active experts as per meta data):
 > * For 7 experts PPL is slightly lower (-0.2%)
 > * For 8 and 9 experts it is about the same
 > * For 10 experts PPL is ~0.3% higher.
+
+> 👤 **saood06** replied on **2025-02-11** at **17:27:49**
 > 
-> 👤 **saood06** replied the **2025-02-11** at **17:27:49**:<br>
 > With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
 > 
 > Experts | PPL
@@ -235,8 +243,9 @@ Just my 2 cents
 > IQ3_XXS (exp=4) | 2.87 | 3.61 | 2.60 | 2.25 | 2.09 | 1.97 | 1.89 | 1.87
 > IQ3_XXS (exp=6) | 2.67 | 3.53 | 2.53 | 2.13 | 1.94 | 1.80 | 1.71 | 1.65
 > IQ3_XXS (def) | 2.69 | 3.53 | 2.51 | 2.11 | 1.91 | 1.78 | 1.69 | 1.62
+
+> 👤 **jukofyork** replied on **2025-02-11** at **19:22:47**
 > 
-> 👤 **jukofyork** replied the **2025-02-11** at **19:22:47**:<br>
 > > > but are you sure that is recommended?
 > > 
 > > I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
@@ -248,8 +257,9 @@ Just my 2 cents
 > >     * For 10 experts PPL is ~0.3% higher.
 > 
 > Yeah, I managed to do this with `dbrx` before the PR that fixes the divisors for the experts separately. IIRC, I actually activated all the experts for `dbrx` and it got a better resulting `imatrix` than the pre-PR code did, and was quite usable.
+
+> 👤 **jukofyork** replied on **2025-02-11** at **19:24:47**
 > 
-> 👤 **jukofyork** replied the **2025-02-11** at **19:24:47**:<br>
 > > With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
 > 
 > This could be because most previous MoEs use softmax to gate/weight with, so as you add more experts is scales down the weights, but `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger (you can probably also hack the weights and bias to counter this though).
@@ -260,13 +270,15 @@ Just my 2 cents
 > INFO:hf-to-gguf:blk.11.exp_probs_b.bias,      torch.float32 --> F32, shape = {256}
 > INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight,   torch.bfloat16 --> F32, shape = {7168, 256}
 > ```
+
+> 👤 **saood06** replied on **2025-02-11** at **20:24:39**
 > 
-> 👤 **saood06** replied the **2025-02-11** at **20:24:39**:<br>
 > > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
 > 
 > Then why does 16 experts work, but not 10/12?
+
+> 👤 **jukofyork** replied on **2025-02-11** at **20:33:32**
 > 
-> 👤 **jukofyork** replied the **2025-02-11** at **20:33:32**:<br>
 > > > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
 > > 
 > > Then why does 16 experts work, but not 10/12?