Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
Original file line number Diff line number Diff line change
@@ -1,26 +1,27 @@
### 🗣️ [#100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
## 🗣️ [Discussion #100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?

| **Author** | `Nexesenex` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2024-10-21 |
| **Updated** | 2024-10-21 |

---

#### Description
## 📄 Description

@ikawrakow, could you set up a CLI argument (or at least an env variable, it's much simpler I guess but I'm failing to do it right) to determine GGML_SCHED_MAX_COPIES without recompiling? It impacts VRAM occupation and performances, and it'd be great to set that up conveniently for benching and customized use.

---

#### 🗣️ Discussion
## 💬 Discussion

👤 **ikawrakow** replied the **2024-10-21** at **08:29:25**:<br>
👤 **ikawrakow** commented on **2024-10-21** at **08:29:25**

I haven't looked into this at all. What is it good for?

---

👤 **Nexesenex** replied the **2024-10-21** at **09:36:22**:<br>
👤 **Nexesenex** commented on **2024-10-21** at **09:36:22**

It's supposed to go faster inference on multi-GPU I guess. Mainline sets it at 4, I set it at 1, because I didn't notice much improvement back in the days, but I noticed more vram consumption and gpu load.
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
### 🗣️ [#104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
## 🗣️ [Discussion #104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize

| **Author** | `Nexesenex` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2024-10-23 |
| **Updated** | 2024-10-23 |

---

#### Description
## 📄 Description

Hey IK.

Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,14 @@
### 🗣️ [#140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
## 🗣️ [Discussion #140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]

| **Author** | `DavidZyy` |
| :--- | :--- |
| **State** | ✅ **Open** |
| **Created** | 2024-12-13 |
| **Updated** | 2025-02-11 |

---

#### Description
## 📄 Description

Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion `quantize_row_q4_0_impl` and other places, `weight[j]` is:
Expand All @@ -21,9 +22,9 @@ weight[j] = qw[j]

---

#### 🗣️ Discussion
## 💬 Discussion

👤 **ikawrakow** replied the **2024-12-14** at **08:13:19**:<br>
👤 **ikawrakow** commented on **2024-12-14** at **08:13:19**

Hi @DavidZyy,

Expand All @@ -40,15 +41,15 @@ Why the need for correcting the Hessian in the first place?

---

👤 **DavidZyy** replied the **2024-12-14** at **13:58:43**:<br>
👤 **DavidZyy** commented on **2024-12-14** at **13:58:43**

Thanks for taking time to answer this question and share information, I learned a lot from your answers.
Yes, it's very interesting :)
> (and it was amusing to observe people trying to make scientific sense out of it)

---

👤 **jukofyork** replied the **2025-02-10** at **17:03:34**:<br>
👤 **jukofyork** commented on **2025-02-10** at **17:03:34**

Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation!

Expand All @@ -66,15 +67,16 @@ but I still suspect that for these new very-high-expert-MoEs it should really be

---

👤 **ikawrakow** replied the **2025-02-10** at **18:07:55**:<br>
👤 **ikawrakow** commented on **2025-02-10** at **18:07:55**

@jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. [This repository](https://github.com/ikawrakow/mnist) uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the [ggml mnist example](https://github.com/ggml-org/ggml/tree/master/examples/mnist) by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., `IQ2_XXS` uses 256 out of 6561 points on the E8 lattice. This prevents overfitting, thus can be considered as "regularization".

The other thing I have learned is that theories are rarely useful in their pure form. More often than not, you start with this beautiful theory to only find that it does not work very well in practice. So, you start adding fudge factors, and things get better. And then you add even more fudge factors and it gets better. When you are done with it you have something that works really well, but you barely recognize the beautiful pure theory you started from.

Just my 2 cents

> 👤 **jukofyork** replied the **2025-02-10** at **19:26:00**:<br>
> 👤 **jukofyork** replied on **2025-02-10** at **19:26:00**
>
> > For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term.
>
> I was late to that discussion, but it was possibly me who mentioned this.
Expand Down Expand Up @@ -138,8 +140,9 @@ Just my 2 cents
> I am certainly no "Bayesian purist" and will happily tune the prior to get the best observed results too!
>
> BUT: I strongly believe the effectiveness of the `imatrix` calculations could be vastly improved by adding some method of interpolation/regularisation/whatever to allow for informed tuning of the weighting factors! :smile:

> 👤 **saood06** replied on **2025-02-10** at **20:23:18**
>
> 👤 **saood06** replied the **2025-02-10** at **20:23:18**:<br>
> > I still think this is an important area to consider (whatever the chosen regularization method is):
> > #### (A) I see people still using using bartowski's same ~250kb `calibration_datav3.txt` file on `Deepseek-V3` as on fully-dense models.
> >
Expand All @@ -159,8 +162,9 @@ Just my 2 cents
> From: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6758d52499eea0c4b65d0475
>
> They do discuss the idea of needing more data because of MoE in that thread. I use their imatrix.dat files, and my ppl numbers I gave you are for IQ4_K_R4.

> 👤 **ikawrakow** replied on **2025-02-11** at **06:01:32**
>
> 👤 **ikawrakow** replied the **2025-02-11** at **06:01:32**:<br>
> Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Are people aware of the fact that one can run the model with more active experts than specified by the meta data?
Expand All @@ -170,8 +174,9 @@ Just my 2 cents
> I think doing that will likely help activate more experts.
>
> I also don't understand why the entire experts tensor cannot be imatrix-quantized if just one expert is missing. If that's what we ended up with, it definitely needs fixing.

> 👤 **saood06** replied on **2025-02-11** at **15:17:30**
>
> 👤 **saood06** replied the **2025-02-11** at **15:17:30**:<br>
> > Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Just layer 0.
Expand Down Expand Up @@ -201,19 +206,22 @@ Just my 2 cents
> They never reported that for any of the Deepseek models so I'm assuming they only encountered it with arctic and no matter what they did they were never able to activate that expert so I'm giving some credence to their theory that "There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate."
>
> Looking at the files in safetensors each expert is stored separately but with a GGUF that is not the case and they are all stored together.

> 👤 **ikawrakow** replied on **2025-02-11** at **16:33:38**
>
> 👤 **ikawrakow** replied the **2025-02-11** at **16:33:38**:<br>
> Thanks for making me aware of this situation. I prepared PR #202 to deal with it.
> Thanks for making me aware of this situation. I prepared PR [#202](https://github.com/ikawrakow/ik_llama.cpp/issues/202) to deal with it.

> 👤 **ikawrakow** replied on **2025-02-11** at **17:11:08**
>
> 👤 **ikawrakow** replied the **2025-02-11** at **17:11:08**:<br>
> > but are you sure that is recommended?
>
> I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
> * For 7 experts PPL is slightly lower (-0.2%)
> * For 8 and 9 experts it is about the same
> * For 10 experts PPL is ~0.3% higher.

> 👤 **saood06** replied on **2025-02-11** at **17:27:49**
>
> 👤 **saood06** replied the **2025-02-11** at **17:27:49**:<br>
> With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> Experts | PPL
Expand All @@ -235,8 +243,9 @@ Just my 2 cents
> IQ3_XXS (exp=4) | 2.87 | 3.61 | 2.60 | 2.25 | 2.09 | 1.97 | 1.89 | 1.87
> IQ3_XXS (exp=6) | 2.67 | 3.53 | 2.53 | 2.13 | 1.94 | 1.80 | 1.71 | 1.65
> IQ3_XXS (def) | 2.69 | 3.53 | 2.51 | 2.11 | 1.91 | 1.78 | 1.69 | 1.62

> 👤 **jukofyork** replied on **2025-02-11** at **19:22:47**
>
> 👤 **jukofyork** replied the **2025-02-11** at **19:22:47**:<br>
> > > but are you sure that is recommended?
> >
> > I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
Expand All @@ -248,8 +257,9 @@ Just my 2 cents
> > * For 10 experts PPL is ~0.3% higher.
>
> Yeah, I managed to do this with `dbrx` before the PR that fixes the divisors for the experts separately. IIRC, I actually activated all the experts for `dbrx` and it got a better resulting `imatrix` than the pre-PR code did, and was quite usable.

> 👤 **jukofyork** replied on **2025-02-11** at **19:24:47**
>
> 👤 **jukofyork** replied the **2025-02-11** at **19:24:47**:<br>
> > With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> This could be because most previous MoEs use softmax to gate/weight with, so as you add more experts is scales down the weights, but `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger (you can probably also hack the weights and bias to counter this though).
Expand All @@ -260,13 +270,15 @@ Just my 2 cents
> INFO:hf-to-gguf:blk.11.exp_probs_b.bias, torch.float32 --> F32, shape = {256}
> INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {7168, 256}
> ```

> 👤 **saood06** replied on **2025-02-11** at **20:24:39**
>
> 👤 **saood06** replied the **2025-02-11** at **20:24:39**:<br>
> > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
>
> Then why does 16 experts work, but not 10/12?

> 👤 **jukofyork** replied on **2025-02-11** at **20:33:32**
>
> 👤 **jukofyork** replied the **2025-02-11** at **20:33:32**:<br>
> > > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
> >
> > Then why does 16 experts work, but not 10/12?
Expand Down
Loading